Why does asking Cursor to write tests not fix vibe-coded PRs?

The coding agent and the test author share the same assumptions. Tests often assert what the model intended, not what the running app does. PR-level testing needs an independent agent that evaluates the change cold, like a user would.

Is this different from CodeRabbit or AI code review?

Yes. Review tools analyse source. A testing agent exercises the deployed or preview app — catching broken checkout flows, wrong totals, and missing validation that never appear as obvious diff comments.

Does DevAssure O2 work with Cursor and GitHub together?

Yes. Install the Invisible (QA) Agent from Open VSX in Cursor for pre-push feedback, then add devassure-ai/devassure-action@v1 on pull_request. Same O2 agent in the IDE and CI.

What preview URL does O2 need for a vibe-coded PR?

A reachable staging or Vercel preview URL for that PR branch — the same URL you would manually click through. O2 drives the app in a browser; it does not merge code without a running environment.

Why Your Vibe-Coded PR Keeps Breaking Production in 2026 (And How to Catch It Before Merge)

Divya Manohar

Co-Founder and CEO, DevAssure

Short answer

Vibe-coded PRs break production because speed outruns validation: the diff looks fine in review, but behaviour regresses. Fix it at the pull request with an agent that tests the running app on each PR — not with more AI-written unit tests or line-by-line review alone.

You merged on Friday. Checkout looked fine in the diff. By Monday, support tickets say promo codes stop applying after a failed payment retry.

The PR was 400 lines, mostly written by Cursor in an afternoon. Two approvals. Green CI — because unit tests passed and nobody had an E2E check for that path.

That handoff — coding agent → pull request → production — is where vibe coding hurts. Not in the editor. Not in the blog post about whether AI is “good at code.” In the moment between open PR and merge.

If you want the toolchain setup for Cursor specifically, read How to Test Cursor-Generated Code first. This post is the sequel: what has to happen on the PR so vibe-coded changes stop slipping through.

The numbers: AI PRs carry more defects before anyone clicks “Merge”

CodeRabbit’s State of AI vs Human Code Generation report (December 2025, 470 open-source PRs) is blunt:

AI-co-authored PRs: 10.83 issues per PR on average
Human-only PRs: 6.45 issues per PR
Roughly 1.7× more issues overall, with higher rates of logic, security, and maintainability findings

Code review tools surface problems in the diff. They do not click through checkout. Many of those 10.83 findings never become a failing E2E check — because there is no E2E check wired to the PR.

So teams ship anyway. Velocity wins until production proves the gap.

For the broader “why AI code needs a different validation model” argument, see The Vibe Coding Quality Gap. Here we stay narrow: the PR is the last gate before users see the bug.

Why AI code fails at the PR stage specifically

Vibe coding fails in predictable ways that unit tests and lint rules miss:

Hallucinated APIs and “almost right” integrations

The model wires a webhook handler, a Stripe call, or an internal SDK method that reads plausible but does not exist — or exists with different parameters. TypeScript might still compile if types are loose. The PR looks professional. The first real request 500s.

Missing edge cases on the happy path

AI excels at the story you prompted: “add discount at checkout.” It often skips retry after decline, expired session, double-submit, empty cart with coupon. Reviewers skim the golden path; users hit the rest.

Silent regressions outside the diff mental model

You changed pricing in checkout/. The agent also “helpfully” refactored a shared hook used on the billing settings page. The PR description does not mention settings. Review focuses on checkout files. Regression lives one import away.

Zero net-new test coverage

Vibe-coded features routinely ship with 0% new E2E coverage. The PR adds behaviour. The test folder does not. CI stays green. Confidence is cosmetic.

None of this is unique to AI. It is amplified when every developer ships 3× larger PRs 3× faster.

What most teams do today (and why it breaks)

Approach	What it catches	What vibe-coded PRs still ship
Human code review	Obvious mistakes, style, some logic	Behavioural bugs in 400-line AI diffs
Lint / SAST	Patterns, some security smells	Wrong totals, broken flows, racey UI
Unit tests	Functions in isolation	Integration and user journeys
“We’ll test in staging”	Sometimes	Often skipped when staging lags behind
Ask the same AI to write tests	Syntax-level coverage	Same blind spots as production code

Early-2026 posts from vendors like ContextQA, TwoCents, and Testkube on how to test AI-generated code lean on static analysis, unit tests, and review discipline. All useful. None replace “did the app actually work when this PR landed?”

That question is behavioural. It belongs at the PR boundary.

Why human review does not scale when everyone is vibe coding

When one senior engineer reviews occasional AI-assisted PRs, review works.

When every engineer vibe-codes:

PRs get longer (more files, more generated boilerplate)
Reviewers spend time on naming and structure, not journeys
The 90th-percentile AI PR hits 26 review findings in CodeRabbit’s data — review fatigue is real
Approvals become rubber stamps with “LGTM, didn’t run it locally”

You cannot hire reviewers linearly with Copilot output.

The scalable move is not “review harder.” It is automate behaviour checks on every PR the way you automate lint — without asking humans to author hundreds of Playwright files.

The category you need: an agent that tests the PR

Split the problem cleanly:

Code review answers: Is this diff reasonable source code?
Static analysis answers: Does this match known bad patterns?
Unit tests answer: Do these functions return expected values in isolation?
PR behavioural testing answers: If a user walks the affected flows in a real browser, does the app still work?

That last category is what autonomous PR testing covers. An agent:

Reads the PR diff (not your chat history with Cursor)
Maps blast radius — which flows could this change break?
Generates targeted scenarios for that scope
Executes against your preview or staging URL
Posts a pass/fail check on the PR

No permanent test suite. No “ask Claude to write Jest tests” loop.

This is vibe coding testing done right: the coding agent and the testing agent are separate. Independence is the product.

Compare to generic PR automation in How to automatically test every pull request in 2026 — same mechanics, this article is the vibe-coded why.

How DevAssure O2 tests a vibe-coded PR

DevAssure O2 is built for the handoff: you finished vibing in the IDE, you opened the PR, now validate before merge.

What O2 does not do: re-read your Cursor thread, approve line lengths, or run only tests you checked in last year.

What O2 does:

Diff-first: understands what changed between your branch and main
Impact map: connects files to user journeys (checkout, onboarding, settings)
Generate scenarios: plain-English steps for flows in scope — e.g. apply promo, fail payment, retry, expect discount still applied
Run in Chrome: headless in GitHub Actions against your preview URL
Report on the PR: failed check, failure narrative, screenshots / session replay

Semantic element resolution means a button rename does not brick the run the way a hard-coded data-testid would.

For concept background on agent-driven validation, see What is Vibe Testing.

Real example: checkout caught at PR vs caught in production

Scenario: Cursor adds “apply discount on payment retry” after a failed card charge.

What code review sees: New branch in handlePaymentRetry, tests for a helper, clean types. LGTM.

What O2 runs on the PR (against https://preview-1842.yourapp.dev):

Scenario: promo persists after failed payment retry
Add item to cart ($100)
Apply code SAVE20
Use test card that declines
Retry payment with valid card
Assert order total is $80

Result: FAIL — total remained $100 after retry

On the PR: red DevAssure O2 check, comment with screenshot, you fix the idempotency bug before merge. Support never sees it.

Without PR behavioural testing: merge Friday, deploy Sunday, 47 support tickets Monday, hotfix Tuesday, postmortem Wednesday. Same bug. Different calendar.

That is the difference between ai code quality pr as a review metric and ai code quality pr as a user actually completed checkout.

Static analysis vs behavioural PR testing (quick contrast)

	Static analysis / AI review on diff	O2 on PR
Input	Source files	Running app + diff
Catches	Patterns, many logic smells in code	Broken flows, wrong UI state
Blind spot	“Looks correct” integrations	N/A for behaviour
Maintenance	Low	No script library — agent per PR
Best for	Every commit	Merge gate on vibe-coded volume

Run both. Do not pretend review replaces a browser.

Setup: one step in your Cursor + GitHub workflow

You already have Cursor and GitHub. Add CI validation in one commit.

1. Secret — DEVASSURE_TOKEN in repo settings (sign up if needed).

2. Workflow — .github/workflows/devassure-o2.yml:

name: DevAssure O2

on:
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: devassure-ai/devassure-action@v1
        env:
          DEVASSURE_TOKEN: ${{ secrets.DEVASSURE_TOKEN }}

3. Preview URL — configure your staging/preview base URL the way your team already exposes Vercel/Netlify PR previews (see GitHub Marketplace action docs).

4. (Optional) Cursor sidebar — Invisible (QA) Agent on Open VSX for git-aware runs before you push. Same agent, earlier feedback.

5. Branch protection — require DevAssure O2 before merge.

Detailed PR mechanics: automatically test every pull request. Cursor-specific IDE tips: test Cursor-generated code.

What to do on a failing vibe-coded PR

Treat O2 like a senior QA who actually exercised the feature:

Read the failed scenario — it is written in user language, not stack traces
Reproduce locally if needed (often the agent is right)
Fix the product code — or revert the vibe-coded chunk
Push — O2 re-runs on the new diff

Do not “fix the test.” There is no checked-in spec to patch.

How to Test Cursor-Generated Code — companion guide (IDE + Actions)
The Vibe Coding Quality Gap — why independent agents beat more AI tests
How to automatically test every PR in 2026 — Playwright vs autonomous gate
DevAssure O2 on GitHub Marketplace

Frequently asked questions

Add a pull_request GitHub Action that validates the diff. Static analysis and unit tests help but miss user-visible regressions. An autonomous agent like DevAssure O2 reads the PR, generates behavioural E2E tests, runs them in a browser, and posts a check — without you authoring Playwright specs.

The numbers: AI PRs carry more defects before anyone clicks “Merge”​

Why AI code fails at the PR stage specifically​

Hallucinated APIs and “almost right” integrations​

Missing edge cases on the happy path​

Silent regressions outside the diff mental model​

Zero net-new test coverage​

What most teams do today (and why it breaks)​

Why human review does not scale when everyone is vibe coding​

The category you need: an agent that tests the PR​

How DevAssure O2 tests a vibe-coded PR​

Real example: checkout caught at PR vs caught in production​

Static analysis vs behavioural PR testing (quick contrast)​

Setup: one step in your Cursor + GitHub workflow​

What to do on a failing vibe-coded PR​

Related reading​

Frequently asked questions​

How do I test AI-generated code on a pull request?