Flaky Test Mitigation

A flaky test is one that passes and fails against identical code, and left unmanaged it corrodes the entire signal your suite produces. Once engineers learn that a red build might just be noise, they start re-running pipelines reflexively, merging through failures, and eventually ignoring the test results altogether — the exact opposite of what a test pyramid strategy is supposed to deliver. This guide treats nondeterminism as an engineering defect with a defined lifecycle: detect it, contain it without hiding real regressions, fix the root cause, and only then return the test to the trusted lane. The techniques here apply across the JavaScript stack, but the worked examples use Vitest as the primary runner and Playwright for browser-level checks, with concrete knobs for retries, quarantine annotations, and seeded fixtures that you can adopt incrementally.

Flaky test lifecycle and quarantine states A state diagram showing a test moving from Trusted to Suspect on a flaky failure, into Quarantine, then either back to Trusted after a clean fix and stable runs, or to Deleted when no owner fixes it. Trusted merge gate Suspect passed on retry Quarantine non-blocking lane Deleted no owner flaky tag + isolate fixed + 50 green deadline missed

Architectural Scope & Boundaries

Mitigation work sits at the seam between test authoring and CI orchestration, and it touches all three tiers of the pyramid differently. Unit tests are rarely flaky for environmental reasons; when they are, the cause is almost always shared mutable state or unseeded randomness, which is fixable at the source. Integration tests flake on timing, ordering, and leaked module state. End-to-end and browser tests flake on real network latency, animation timing, and resource contention under parallel load. The strategies in this section are deliberately tiered to match those causes.

This material covers four things and explicitly excludes a fifth. It covers: classifying a failure as a genuine regression versus nondeterminism; bounding retries so they buy stability without masking bugs; isolating unstable specs into a separate lane with measurable exit criteria; and removing the most common source of nondeterminism — unseeded data and uncontrolled time. It does not cover writing the underlying assertions or component harnesses; for that, see Playwright component testing and the broader Component & Integration Testing work. Mitigation assumes the test is correct in intent and only its determinism is in question.

The boundary that matters most is between containment and concealment. A retry that silently turns a real intermittent bug green is concealment; a retry that surfaces the flake in a report while keeping the merge queue moving is containment. Every knob in this section is chosen to stay on the containment side of that line.

Prerequisites

  • vitest.config.ts (see the Vitest configuration setup baseline).
  • @playwright/test and a playwright.config.ts.
  • @faker-js/faker (v8+) if your fixtures generate synthetic data.
  • cost-benefit analysis of test layers so quarantine decisions weigh the value of each test honestly.

Step-by-Step Implementation

The lifecycle below moves a suspect test from detection through to either a clean fix or a justified removal. Each step has a focused, runnable configuration.

Step 1 — Make flakes observable before you react to them. You cannot manage what you cannot count. Configure Playwright to retain a trace on the first retry so every flake produces a forensic artifact rather than a vanished failure.

// playwright.config.ts
import { defineConfig } from '@playwright/test';

export default defineConfig({
  retries: process.env.CI ? 2 : 0,
  reporter: [['list'], ['json', { outputFile: 'results.json' }]],
  use: {
    trace: 'on-first-retry',
    screenshot: 'only-on-failure',
    video: 'retain-on-failure',
  },
});

Step 2 — Bound retries with a deliberate budget, not an open door. Two retries is a common ceiling: it absorbs genuine one-in-a-thousand environmental blips while keeping a persistently failing test visibly red. Set retries to 0 locally so authors feel their own flakes immediately.

// vitest.config.ts
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    // Vitest retries individual tests; keep it low and CI-only.
    retry: process.env.CI ? 2 : 0,
    reporters: ['default', 'json'],
    outputFile: { json: './vitest-results.json' },
  },
});

Step 3 — Tag the unstable specs. Annotate suspect tests so tooling can route them. Playwright supports tags directly in the title; Vitest uses a custom annotation convention you can filter on.

// example.spec.ts (Playwright)
import { test, expect } from '@playwright/test';

test('checkout completes @flaky', async ({ page }) => {
  await page.goto('/checkout');
  await expect(page.getByRole('status')).toHaveText('Order placed');
});
// example.test.ts (Vitest) — name-based tagging filtered in CI
import { test, expect } from 'vitest';

test('[quarantine] settles async price calc', async () => {
  expect(await computePrice()).toBe(4200);
});

Step 4 — Route tagged tests into a non-blocking lane. The main job excludes the quarantine tag and stays a hard merge gate; a second, non-blocking job runs only the tagged tests and reports trends without breaking the build.

# .github/workflows/test.yml (excerpt)
jobs:
  trusted:
    runs-on: ubuntu-latest
    steps:
      - run: npx playwright test --grep-invert @flaky
  quarantine:
    runs-on: ubuntu-latest
    continue-on-error: true   # informational, never blocks merge
    steps:
      - run: npx playwright test --grep @flaky

Step 5 — Attack the root cause with determinism. Most flakes that survive into quarantine are data- or time-driven. Freeze the clock and seed every random source so a fixture that fails on Tuesday at 23:59 UTC also fails on your laptop at noon.

// vitest.setup.ts
import { beforeEach, afterEach, vi } from 'vitest';
import { faker } from '@faker-js/faker';

beforeEach(() => {
  vi.useFakeTimers();
  vi.setSystemTime(new Date('2026-06-21T12:00:00Z'));
  faker.seed(20260621); // identical synthetic data every run
});

afterEach(() => {
  vi.useRealTimers();
});

Step 6 — Promote a test back to the trusted lane on evidence, not hope. Once a fix lands, the test must demonstrate stability — for example, 50 consecutive green runs in the quarantine lane — before its tag is removed. This exit criterion is what separates mitigation from sweeping the problem under the rug.

Configuration Reference Table

Knob Tool Type Default Effect
retries Playwright number 0 Re-runs a failed test up to N times; a test that passes on retry is reported as “flaky”, not “passed”.
retry Vitest number 0 Re-runs a failing test up to N times before marking it failed.
trace Playwright string 'off' 'on-first-retry' captures a full trace only when a test flakes, keeping artifacts cheap.
--grep / --grep-invert Playwright regex none Includes or excludes tests by title tag; the basis for the quarantine lane.
continue-on-error CI job boolean false Lets the quarantine job report without blocking the merge.
faker.seed(n) faker number random Pins the PRNG so generated fixtures are byte-identical across runs.
vi.setSystemTime(date) Vitest Date system clock Freezes Date.now() and timers for time-dependent assertions.
maxFailures Playwright number 0 Bails the run early after N failures to shorten feedback on broken builds.
fullyParallel Playwright boolean false Higher parallelism increases contention-driven flakes; tune per-suite.
flake budget policy percent team-set The retry-success rate above which a test is auto-quarantined.

Verification & Assertions

Confirm the machinery works before trusting it. After enabling retries with tracing, force a known intermittent failure and check that the report distinguishes a flaky outcome from a clean pass. Playwright’s summary will read something like 1 flaky rather than 1 passed, and a trace.zip will appear under test-results/. That distinction is the whole point: a green build with zero flaky entries is trustworthy, while a green build with a rising flaky count is a warning you can act on.

For seeded determinism, assert reproducibility directly. Generate a fixture twice within the same seeded context and assert deep equality; then run the file in isolation versus inside the full suite and confirm identical output. A divergence proves state is leaking across files — the signature failure mode that quarantining alone would only hide.

import { test, expect, beforeEach } from 'vitest';
import { faker } from '@faker-js/faker';

beforeEach(() => faker.seed(42));

test('seeded fixture is reproducible', () => {
  const a = faker.person.fullName();
  faker.seed(42);
  const b = faker.person.fullName();
  expect(a).toBe(b);
});

The quarantine lane is verified by inspecting CI: the trusted job must turn red on a real regression while the quarantine job stays informational. Open a PR that breaks a quarantined test and confirm the merge button remains enabled; break a trusted test and confirm it blocks.

Edge Cases & Failure Modes

Retries that mask a real intermittent bug. If a feature genuinely fails one request in fifty, retries will paper over it and ship the defect. Guard against this by treating a rising flaky rate as a regression signal in its own right — track the count, alert on growth, and never let “it passed on retry two” close an investigation. The companion guide on retrying flaky Playwright tests without masking bugs covers the trace-driven triage that keeps retries honest.

Quarantine becoming a graveyard. Tests dumped into the quarantine lane with no exit criteria accumulate forever, and coverage silently rots. Every quarantined test needs an owner and a deadline; if neither materializes, deleting the test is more honest than pretending it guards anything.

Order-dependent failures that seeding cannot fix. Seeding randomness and freezing time will not save a test that depends on another test having run first. Detect these by shuffling execution order (--sequence.shuffle in Vitest) and isolating the failures; the fix is proper teardown, not retries.

Shared singletons across parallel workers. Module-level caches, a single MSW server, or a shared database connection will produce contention flakes that scale with fullyParallel. Scope state to the worker or reset it per-file, mirroring the reset discipline used in external service simulation.

Performance & CI Impact

Retries trade wall-clock time for stability, and the trade is asymmetric: a two-retry ceiling adds latency only to tests that actually fail, so a healthy suite pays almost nothing while a sick one pays loudly — which is the correct incentive. Tracing on-first-retry keeps artifact storage proportional to flake volume rather than total test count, avoiding the gigabytes that trace: 'on' would generate.

The quarantine lane’s biggest performance win is psychological and structural: by removing unstable tests from the merge gate, you stop the cascade of full-pipeline re-runs that flakes provoke, which is often the single largest source of wasted CI minutes. Deterministic seeding has near-zero runtime cost and frequently reduces it by eliminating the retry rounds those flakes would have triggered. When you measure the impact, fold it into the same ledger you use for balancing speed and coverage in monorepo testing so flake-mitigation spend is weighed against the feedback-loop time it buys back.

In-Depth Guides