What is Microsoft ASSERT?

ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) is an open-source Microsoft framework that converts natural-language policies and requirements into executable evaluation pipelines. It generates test scenarios, runs them against AI agents or models, and scores results with reasoning — so teams see why something passed or failed, not just a boolean flag. It works across LangChain, CrewAI, LiteLLM, and other stacks.

What is the Agent Control Specification (ACS)?

ACS is an open, vendor-neutral standard from Microsoft for runtime governance of AI agents. It defines checkpoints in an agent lifecycle — input, model call, state, tool execution, and output — where deterministic controls (classifiers, LLM judges, content filters) can be applied. Policies are expressed as portable YAML manifests, similar to how MCP standardized tool access.

Why did Microsoft build ASSERT and ACS?

Because AI agents fail in ways that are hard to see, and generic benchmarks do not catch failures specific to your policies, product, and users. Microsoft recognized that an agent's self-assessment is not sufficient evidence that it behaves as intended — the same structural gap that exists when coding agents merge PRs validated only by tests they wrote themselves.

Does ASSERT test the applications that coding agents build?

No. ASSERT and ACS evaluate agent behavior — did the agent follow policy, stay in scope, avoid unsafe actions. They do not validate whether the application the agent built actually works end to end in a browser the way a real user would experience it. That is a different layer of the stack, where DevAssure O2 operates.

Can the same AI agent write code and test it reliably?

No — not as a sole quality gate. A coding agent tests against the same mental model it used to write the code. Independent evaluation is required: either ASSERT-style policy evals for agent behavior, or browser-based PR testing like DevAssure O2 for application behavior. See our post on why coding agents cannot be testing agents.

How do I add independent PR testing for AI-generated code?

Add devassure-ai/devassure-action@v1 to your GitHub workflow. O2 analyzes each PR diff, maps affected user journeys, generates and runs tests in a real browser, and comments results — regardless of whether the code was written by a human, Cursor, Copilot, or another agent. $50 in free credits for 30 days on the GitHub Marketplace.

What is DevAssure O2?

DevAssure O2 is an autonomous testing agent that tests your web application through the browser — the same way a human tester would — but driven by AI instead of scripts. You write test cases in plain English; O2 navigates your application, interacts with UI elements, and validates outcomes.

How is O2 different from Selenium or Playwright?

Scripted tools require selectors that break when the UI changes. O2 uses intent-driven testing: you describe what the user does, and the agent figures out how to execute it using visual reasoning. When a button moves or gets a new CSS class, O2 still finds it because the intent has not changed.

Does O2 require test scripts or selectors?

No. O2 tests are written in plain English YAML — no CSS selectors, XPath, or page object models. There is no second codebase of test scripts to maintain when your application UI changes.

Can O2 test Salesforce Lightning or Flutter Web?

Yes. O2 handles Shadow DOM and dynamic IDs in Salesforce Lightning natively, and uses visual reasoning on canvas-rendered Flutter Web UIs where traditional DOM-based tools cannot select elements at all.

How do I run O2 in CI/CD?

Add devassure-ai/devassure-action@v1 to your GitHub Actions workflow for one-line PR testing, or run devassure run-tests from Jenkins, CircleCI, GitLab CI, or any CI system via the DevAssure CLI.

Is DevAssure SOC 2 certified?

Yes. DevAssure is SOC 2 Type II certified. Test credentials are stored encrypted, session recordings go to your configured archive location, and O2 runs in an isolated browser context per test session.

2 posts tagged with "PR Testing"

Microsoft Just Built a Framework to Test AI Agents.

Divya Manohar

Co-Founder and CEO, DevAssure

Short answer

At Microsoft Build 2026, Microsoft shipped ASSERT (policy-driven agent evaluation) and ACS (runtime agent governance) — because the agent that writes the code cannot be the agent that grades the code. That is the same principle behind DevAssure O2: independent, browser-based testing on every PR, written in plain English, with no scripts to maintain.

At Microsoft Build 2026, Microsoft announced something that quietly confirms the core thesis behind DevAssure: as AI agents take over more of the software development lifecycle, the agent that writes the code cannot be the agent that grades the code.

The announcement was a pair of open-source projects — ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) and the Agent Control Specification (ACS) — designed to give developers a portable, framework-agnostic way to evaluate and govern AI agents before their behavior ships to production. Coming from the company now positioning itself as the "agent-first" platform for enterprise development, this is a meaningful signal about where the industry is heading.

I want to walk through what Microsoft actually shipped, why it matters beyond agent safety, and what it means for teams where 30–40% of code is already AI-generated — because the validation gap Microsoft just named at the agent layer is the same gap most engineering teams still have at the application layer.

Introducing DevAssure O2 | The Autonomous Testing Agent for Every Pull Request

Divya Manohar

Co-Founder and CEO, DevAssure

Software testing has changed. Your tools should too.

Two years ago, the best we could offer developers was a faster way to write test scripts. Record your actions, generate a script, maintain it when the UI changes, debug it when it breaks.

That model is reaching its limits.

In 2026, AI coding tools write 20% or more of new code at companies like Google and Microsoft. Developers ship multiple PRs per day. Release cycles have compressed from weeks to days. And the old approach — write scripts, maintain scripts, fix flaky scripts — can't keep pace with how fast code moves.

DevAssure started as a low-code test automation platform. We built a recorder, a visual test builder, self-healing locators, and test data management. Hundreds of teams used it to automate faster than they could with Selenium or Playwright alone.

But we kept hearing the same thing from our users:

"The automation is faster, but we're still spending 30–40% of our time maintaining tests."

That feedback led us to rethink the problem entirely. The result is O2 Agent — an autonomous testing agent that reads your code changes, generates targeted tests, and executes them in a real browser. No scripts. No selectors. No maintenance.

This post introduces what DevAssure O2 is, how it works, and why we built it.

Software testing has changed. Your tools should too.​

Software testing has changed. Your tools should too.