Skip to main content

Your Test Suite Passes. Your Users Found 6 Bugs. What Went Wrong?

Divya Manohar
Co-Founder and CEO, DevAssure

Last month I had a conversation with a CTO that stuck with me.

Their team has 3,400 tests. 94% coverage. A CI pipeline that runs on every PR. Tests pass reliably — less than 2% flaky rate. By every industry metric, this is a well-tested codebase.

They also had 6 production bugs in the past 30 days. All reported by users. All missed by the test suite.

I asked him to send me the bugs. Here's what they were:

  1. A modal didn't close when clicking outside it. Users had to refresh the page to dismiss a confirmation dialog.
  2. A price displayed as $1,299 in the cart but charged $12.99. Decimal formatting inconsistency between the display component and the payment API.
  3. The "Export to CSV" button worked on Chrome, broke on Safari. Downloaded an empty file.
  4. A newly added field was editable for admins but displayed as read-only for regular users — the opposite of what it should have been. Permission logic was inverted.
  5. A search that returned 0 results showed the previous results instead of an empty state. Stale state from a React component not resetting.
  6. The onboarding flow skipped step 3 entirely when the user's timezone was UTC+0. A conditional that checked for a truthy timezone value — and 0 is falsy in JavaScript.

None of these are exotic edge cases. Every one of them is something a human using the app would hit within 5 minutes.

And none of them were caught by 3,400 tests at 94% coverage.

Why?

The gap between what tests verify and what users experience

I've been thinking about this gap for a long time — first as an engineer at Microsoft and eBay, and now as someone who builds testing tools. And I keep seeing the same pattern.

Most test suites verify that code does what the developer intended. They don't verify that the application behaves the way a user would expect.

This is a subtle but critical distinction. Let me break it down with the actual bugs from above.

Bug 1: Modal doesn't close on outside click

The team had a test for the modal:

test('confirmation modal opens and closes', () => {
render(<ConfirmationModal />);
fireEvent.click(screen.getByText('Delete'));
expect(screen.getByRole('dialog')).toBeVisible();
fireEvent.click(screen.getByText('Cancel'));
expect(screen.queryByRole('dialog')).not.toBeInTheDocument();
});

The test verifies: modal opens on trigger, closes on Cancel button. Both pass. Coverage counts this as tested.

What the test doesn't verify: clicking outside the modal. Pressing Escape. Clicking the overlay. These are all standard modal interactions that every user expects. But the developer who wrote the modal test was thinking about the modal's API (open/close), not about every way a user might dismiss it.

Bug 3: CSV export works on Chrome, breaks on Safari

The test:

test('export generates CSV file', async () => {
const blob = await exportToCSV(mockData);
expect(blob.type).toBe('text/csv');
expect(blob.size).toBeGreaterThan(0);
});

This test runs in Node.js via Jest. It tests the data transformation logic — correctly. The CSV content is valid. The test passes.

But in Safari, the browser's Blob constructor handles the type parameter slightly differently when combined with URL.createObjectURL. The download triggers, but the file is empty. This is a browser-specific runtime behavior that no unit test — running in Node.js — would ever catch.

Bug 6: Timezone 0 is falsy in JavaScript

if (user.timezone) {
showStep3(); // Timezone-specific onboarding
}

The developer's tests used timezone 'America/New_York' and 'Asia/Kolkata' — both truthy strings. Coverage tool says this branch is tested.

But UTC+0, represented as the number 0, is falsy. The conditional skips step 3 for every user in the UTC+0 timezone. This includes London, Lisbon, Accra, and Reykjavik.

The test covered the branch. It didn't cover the boundary.

Why coverage lies

I'm not the first person to say that coverage is a misleading metric. But I want to be specific about how it misleads, because the pattern is consistent.

Coverage measures which lines of code were executed. It doesn't measure which behaviors were validated.

You can have 100% line coverage and zero behavioral coverage. Here's a function with a "perfect" coverage test that validates nothing:

function calculateDiscount(price, code) {
if (code === 'SAVE20') return price * 0.8;
if (code === 'HALF') return price * 0.5;
return price;
}

// "100% coverage" test:
test('calculateDiscount', () => {
calculateDiscount(100, 'SAVE20');
calculateDiscount(100, 'HALF');
calculateDiscount(100, 'NONE');
});

Every line is executed. Coverage is 100%. But there are zero assertions. The function could return undefined for every input and the test would still pass.

This is an extreme example, but the pattern shows up in subtler forms everywhere:

  • Tests that assert the function "doesn't throw" but don't check the return value
  • Tests that verify a component "renders" but don't check what it renders
  • Tests that call an API endpoint and check the status code but don't validate the response body
  • Tests that verify state changes but don't verify the UI reflects those changes

In each case, the coverage tool says the code is tested. The user says otherwise.

The three layers most test suites miss

After looking at hundreds of production bug reports across multiple companies, I've noticed that most escaped bugs fall into three categories:

Layer 1: Interaction patterns

Users don't just click buttons in sequence. They:

  • Click outside modals to dismiss them
  • Double-click submit buttons (especially on slow connections)
  • Use keyboard navigation (Tab, Enter, Escape)
  • Resize their browser window mid-flow
  • Switch between tabs and come back
  • Hit the back button at unexpected moments
  • Copy-paste into input fields (which doesn't trigger onChange in some frameworks)

Most test suites test the "click button, see result" path. They don't test the 15 other ways a user might interact with the same element.

Layer 2: Cross-boundary behavior

The most costly bugs live at the boundary between two systems:

  • Frontend displays a price; backend charges a different amount (bug #2 above)
  • API returns data in one format; frontend expects another
  • Database stores a value; cache returns a stale version
  • Component A updates state; Component B doesn't re-render

Unit tests, by definition, test units in isolation. Integration tests can catch these, but most integration tests still mock the boundaries — they mock the API response, mock the database, mock the external service. The mock returns what the developer expects. The real system returns something slightly different.

Layer 3: Environment-specific behavior

Code that works in your test environment can break in production because:

  • Different browser engines (Safari's Blob behavior, Firefox's date parsing)
  • Different screen sizes (mobile viewport hides a button behind another element)
  • Different data volumes (the page works with 10 items, hangs with 10,000)
  • Different locales (number formatting, date formats, currency symbols, RTL text)
  • Different timezones (the UTC+0 bug above)

Tests running in Node.js via Jest/Vitest can't catch browser-specific bugs. Tests running on a developer's MacBook can't catch mobile layout issues. Tests running with seeded data can't catch pagination performance problems.

What actually catches these bugs

I don't think the answer is "write more tests." The CTO I talked to has 3,400 tests. Adding 340 more wouldn't have caught bugs #1–6. The kind of testing matters more than the quantity.

Here's what I've seen work:

1. Test behaviors, not implementations

Instead of testing that handleClose() sets isOpen to false, test that a user can dismiss the modal in every way they'd try: Cancel button, outside click, Escape key, overlay click. The implementation can change; the expected behavior shouldn't.

The shift is from "does the function return the right value?" to "can the user accomplish the task?"

2. Test at the real boundary

If your frontend displays prices and your backend charges them, test the full path: create a real order, check the displayed amount, check the charged amount, verify they match. Don't mock the payment API — call it (in a sandbox). The bug lives at the boundary, so the test must cross the boundary.

3. Test with real browsers, not simulated DOMs

jsdom and happy-dom are great for fast feedback loops. They're terrible for catching browser-specific bugs. If your users use Safari, some tests need to actually run in Safari. Not all of them — but the critical paths (checkout, export, file upload, authentication) should run in real browsers.

4. Test with adversarial inputs

Your happy path tests use "John", "john@example.com", 100. Real users submit:

  • Empty strings
  • Strings with 500 characters
  • Unicode and emoji
  • Negative numbers
  • Zero (the falsy number)
  • null and undefined (from API responses that didn't include an optional field)
  • HTML tags (the accidental XSS test)

Your tests should include at least one adversarial input per field. Not because you expect users to type "<script>alert(1)</script>" — but because the same code path that handles weird input also handles edge cases in normal input.

5. Test the transitions, not just the states

Most tests verify state A (before) and state B (after). They don't test the transition itself:

  • What does the user see during the loading state?
  • What happens if they interact with the page while the API call is in flight?
  • What happens if they navigate away and come back?
  • What happens if the network is slow and they click twice?

Transitions are where flicker, stale state, double-submits, and race conditions live.

The uncomfortable truth

Here's what I've come to believe after years of working on this problem:

A test suite that passes gives you confidence. But confidence and correctness are different things.

A test suite can be perfectly green and your application can be full of bugs — not because the tests are bad, but because the tests are testing the wrong layer of reality. They're testing the code's internal logic. They're not testing the user's external experience.

The CTO's 3,400 tests verify that the codebase works as the developers intended. They don't verify that the application works as users expect. Those are two different things, and closing the gap between them is, I think, the most important unsolved problem in software quality today.

I don't have a clean answer. I'm working on one. But the first step is acknowledging that "all tests pass" is the beginning of quality, not the end.


I'm Divya Manohar. I've spent the last decade thinking about why tested software still breaks — first as an engineer at Microsoft and eBay, now as the co-founder of a testing company. I write about software quality, AI-generated code, and the gap between test coverage and real confidence. Find me on LinkedIn or at divya@devassure.io.