Retrying Flaky Playwright Tests Without Masking Bugs

Retries are the most misused tool in browser testing. Bolt retries: 3 onto a noisy Playwright suite and the build turns green, but you have not fixed anything — you have only changed how often the underlying defect ships unnoticed. This guide is for QA engineers and full-stack developers running Playwright 1.4x in CI who want retries to buy stability while keeping every intermittent failure visible and triaged. The goal is a configuration where a retry that succeeds is treated as a yellow flag worth investigating, never a green checkmark that closes the case. We lean on trace-on-retry artifacts to distinguish a genuine race condition in product code from environmental noise like a cold cache or a contended CI agent.

Root Cause Analysis

A Playwright test fails intermittently for one of two fundamentally different reasons, and conflating them is what makes retries dangerous. The first is test-side nondeterminism: an expect that races a network response, a hard-coded waitForTimeout, an animation that has not settled, or a selector that matches a transiently duplicated element. The second is product-side nondeterminism: the application itself genuinely fails one request in N because of a backend race, an unhandled rejection, or a caching bug. Retries are a legitimate remedy only for the first category.

The danger is that retries treat both categories identically — they re-run and report success — so a product bug that surfaces 2% of the time is laundered into a passing build. The symptom is a suite that is “green but slow”, with a quietly growing count of flaky outcomes that nobody reads. The deeper cause is usually that the suite leans on E2E checks for logic that belongs lower in the test pyramid strategy, so timing-sensitive flows are over-exercised at the most fragile tier. The fix is twofold: bound retries so they cannot hide a persistent failure, and capture enough forensic data on each flake to classify it correctly.

Reproducible Setup

Start from a minimal Playwright project and a deliberately flaky test so you can see the machinery work.

npm init -y
npm install -D @playwright/test
npx playwright install chromium
// playwright.config.ts
import { defineConfig } from '@playwright/test';

export default defineConfig({
  retries: process.env.CI ? 2 : 0,
  reporter: [['list'], ['html', { open: 'never' }]],
  use: {
    trace: 'on-first-retry',
    screenshot: 'only-on-failure',
    video: 'retain-on-failure',
  },
});
// tests/checkout.spec.ts — intentionally racy to demonstrate triage
import { test, expect } from '@playwright/test';

test('order confirmation appears', async ({ page }) => {
  await page.goto('/checkout');
  await page.getByRole('button', { name: 'Place order' }).click();
  // BAD: races the async confirmation render
  await expect(page.getByRole('status')).toHaveText('Order placed');
});

Running npx playwright test locally with retries: 0 makes the author feel the flake immediately, which is the correct incentive. In CI the two-retry ceiling keeps the merge queue moving while still flagging the instability.

Implementation

Step 1 — Set a retry budget, not an open door. Two is a sound default. It absorbs rare infrastructure blips while keeping a test that fails three times in a row unambiguously red. Avoid retries: 3 or higher unless a specific suite has a documented justification, because each additional retry geometrically increases the chance of laundering a real bug.

// playwright.config.ts (per-project override is also possible)
export default defineConfig({
  retries: process.env.CI ? 2 : 0,
  maxFailures: process.env.CI ? 10 : 0, // bail early on a broken build
});

Step 2 — Capture a trace only when it matters. trace: 'on-first-retry' records a full action-by-action timeline the moment a test flakes, so you can replay exactly what the browser did without paying storage cost on every green run.

use: {
  trace: 'on-first-retry',
},

Open the artifact with npx playwright show-trace test-results/.../trace.zip and step through DOM snapshots, network calls, and console output at the point of failure.

Step 3 — Read the flake count as a first-class signal. Playwright’s JSON reporter records a flaky status distinct from passed and failed. Surface it. A build with 0 flaky is trustworthy; a build with a non-zero and rising flaky count is a defect report, even though it is technically green.

// scripts/flaky-gate.ts — fail CI if flaky count exceeds budget
import results from '../results.json' assert { type: 'json' };

const flaky = results.suites
  .flatMap((s: any) => s.specs)
  .filter((spec: any) => spec.tests.some((t: any) => t.status === 'flaky')).length;

const BUDGET = Number(process.env.FLAKY_BUDGET ?? 0);
if (flaky > BUDGET) {
  console.error(`Flaky budget exceeded: ${flaky} > ${BUDGET}`);
  process.exit(1);
}

Step 4 — Classify each flake from its trace. This is the human step that keeps retries honest. Open the trace and ask: did the failure originate in the test’s timing assumptions or in the application’s behavior?

  • Test-side — the assertion fired before the UI settled; the network call in the trace completed after the failed expect. Fix the test with web-first assertions and remove the retry’s reliance.
  • Product-side — the trace shows a 500 response, an unhandled promise rejection in the console, or genuinely inconsistent server output. This is a bug ticket, not a retry candidate. Removing it from the merge gate, if needed, belongs in the quarantining flaky tests in CI workflow, never under a higher retry count.

Step 5 — Fix test-side flakes with web-first assertions and controlled time. Replace timeouts and racing assertions with auto-retrying expectations, and freeze nondeterministic inputs. Clock and randomness control draw on the same time and date control strategies used elsewhere in the suite.

// Corrected: web-first assertion auto-waits for the element and text
await page.getByRole('button', { name: 'Place order' }).click();
await expect(page.getByRole('status')).toHaveText('Order placed', { timeout: 10_000 });

// Freeze time so date-derived UI is deterministic
await page.clock.install({ time: new Date('2026-06-21T12:00:00Z') });

Verification

After the changes, run the suite enough times to expose residual instability. A single green pass proves nothing about a flake; loop it.

# Repeat the suite to surface intermittent failures
npx playwright test --repeat-each=20 tests/checkout.spec.ts

A correctly fixed test reports 20 passed with 0 flaky across the repeated runs. If you instead see entries like 1 flaky, the report tells you the retry rescued it — open the retained trace and continue triage. The HTML report’s per-test view labels retried-then-passed runs explicitly, so a stakeholder scanning the report can see the difference between “stable” and “passed on the second try”. The flaky-budget gate from Step 3 then turns a rising flake rate into a hard CI failure, which is the guarantee that retries are containing noise rather than concealing regressions.

Troubleshooting

Symptom: tests pass on retry but the feature is genuinely broken in production. This is retries masking a product bug. Diagnosis: open the trace from the first (failed) attempt and look for server errors or console rejections rather than timing gaps. Fix: file the defect, lower or remove retries for that spec while it is investigated, and add a lower-tier test that reproduces the failure deterministically.

Symptom: traces are missing even though a test flaked. Diagnosis: trace is set to 'off' or 'retain-on-failure' without retries enabled, so the first-retry capture never triggers. Fix: confirm retries is at least 1 and trace: 'on-first-retry' is set; verify the run actually executed in CI where retries are enabled, since the local config uses retries: 0.

Symptom: the suite is slow because many tests retry. Diagnosis: a high baseline flake rate is multiplying wall-clock time. Fix: do not raise the retry count to “stabilize” — that hides more bugs. Instead route the worst offenders into a separate lane and budget the cost against feedback time as described in balancing speed and coverage in monorepo testing.

FAQ

Should I just set retries to zero to force every flake to be fixed?

Zero retries is the right default locally because it gives authors immediate feedback, but in CI a small budget of one or two is pragmatic. A shared pipeline runs on contended agents where rare, genuinely environmental blips are unavoidable, and zero retries would turn those into spurious red builds that erode trust as badly as flakes do. The discipline that keeps a CI retry budget safe is the rising-flake-count gate, not the absence of retries.

How is a “flaky” result different from a “passed” result in Playwright?

Playwright marks a test flaky when it failed at least once and then passed on a retry within the same run, whereas passed means it succeeded on the first attempt. The distinction is reported separately precisely so you can treat flaky as a warning rather than a success. Build your CI gate to read the flaky count and alert or fail when it grows, instead of collapsing flaky into passed.

Can trace-on-retry tell me whether the bug is in my test or my app?

Yes, and that is its primary value. The trace captures network requests, DOM snapshots, and console output aligned to each action, so you can see whether the failed assertion fired before a still-pending response settled (test-side timing) or whether the server returned an error or inconsistent data (product-side). That classification is what determines whether the correct fix is a web-first assertion or a bug ticket.

Does this approach work for Playwright component tests too?

The retry and trace configuration applies identically to component tests, since they run through the same test runner. The triage differs slightly because component tests isolate the UI from real backends, so most residual flakes are timing- or state-related rather than network races. See Playwright component testing for the component harness specifics that pair with these retry settings.