HomeAI NewsMicrosoft’s ASSERT is Redefining AI Agent Testing

Microsoft’s ASSERT is Redefining AI Agent Testing

Released at Build 2026, this open-source framework turns your natural language safety policies into executable, trace-grounded evaluations—all while keeping your data strictly local.

  • End-to-End Automation: ASSERT converts plain text behavior policies into executable, multi-turn evaluation test cases without requiring manual scripting.
  • Deep Trace-Grounded Scoring: It evaluates the entire agent journey—including tool calls, retrieved context, and intermediate routing decisions—rather than just grading the final output.
  • Local-First and Ecosystem Agnostic: Released as a free, open-source (MIT) tool, it supports over 100 models and major frameworks like LangChain, CrewAI, and AutoGen, prioritizing data sovereignty by storing all artifacts locally.

Building smart AI agents is no longer the primary hurdle for enterprise teams; proving they are safe is. Engineering teams usually start with clear, plain-text safety policies or product requirements, but manually translating those intentions into executable, maintainable test cases is a massive operational bottleneck. Manual tests drift, and generic benchmarks fail to capture highly specific corporate policies.

Enter ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing). Unveiled at Microsoft Build 2026, this open-source framework bridges the gap between what you want your AI to do and how you prove it does it, allowing developers to test agents simply by describing their desired behavior in natural language.

The Spec-to-Eval Pipeline

Instead of forcing developers to hand-code hundreds of brittle assertions, ASSERT processes your natural language policies through a highly automated, four-stage pipeline:

  1. Systematization: It begins by converting a broad, plain-text rule (e.g., “Do not share confidential data with external email addresses”) into a structured concept specification.
  2. Taxonomization: That concept is translated into a human-editable taxonomy of permissible and impermissible behaviors, giving policy experts a chance to refine the rules before testing begins.
  3. Test Set Generation: ASSERT automatically builds a stratified suite of single- and multi-turn test scenarios, ensuring the agent is probed across edge cases and complex conditions.
  4. Inference and Scoring: The tests run against your agent. An LLM judge evaluates the full operational trace—citing specific tool calls, retrieved context, and routing decisions as evidence—to deliver a detailed, justified verdict.

Privacy and Ecosystem Compatibility

ASSERT champions data sovereignty. It operates with a local-first philosophy, meaning no telemetry is sent to Microsoft by default. All outputs are written as clean, inspectable JSON and JSONL artifacts locally, making them easy to integrate into CI pipelines and version control.

The framework is also fiercely agnostic. Through its LiteLLM and OpenTelemetry integrations, ASSERT seamlessly evaluates LangGraph, CrewAI, AutoGen, and custom setups across more than 100 model endpoints (including Azure, AWS Bedrock, Anthropic, and OpenAI).

The Complete Trust Stack: ASSERT + ACS

ASSERT didn’t launch in a vacuum; it arrived alongside the Agent Control Specification (ACS), an open standard for placing deterministic safety controls within agent workflows.

They are designed to work as a closed loop. While ASSERT tests for policy violations, ACS enforces the rules by placing guardrails (like content filters or LLM judges) at exact workflow checkpoints. Together, they form a continuous trust lifecycle: use ASSERT to identify a vulnerability, patch it with ACS, and re-run ASSERT to verify the fix.

Where ASSERT Fits in 2026

The AI testing landscape is crowded, but ASSERT carves out a vital niche. While tools like Braintrust and Langfuse excel at post-deployment live monitoring, and DeepEval handles native python CI/CD integration, ASSERT owns the pre-production spec-to-eval space.

It is important to note that ASSERT does not eliminate the human element. The framework’s LLM judge is only as accurate as the behavior taxonomy you approve, requiring domain experts to eliminate ambiguity early on. However, if your team needs application-specific testing, trace-grounded evidence, and local-first data storage, ASSERT is the missing link needed to deploy agents with confidence.

Helen
Helen
Lead editor at Neuronad covering AI, machine learning, and emerging tech.

Must Read