Quarantining Flaky Tests in CI

When a test fails intermittently and a fix is not immediately available, you face a bad trilemma: leave it in the merge gate and let it block unrelated work, delete it and lose coverage, or raise retries and risk hiding a real bug. Quarantine is the fourth option — move the unstable test into a separate, non-blocking lane where it keeps running and reporting, but can no longer veto a merge. This guide is for tech leads and platform engineers running Vitest and Playwright 1.4x in CI who need a quarantine workflow that is auditable rather than a black hole. We cover tagging and annotation, the separate CI job, a trend dashboard, and — most importantly — the exit criteria that bring a test back or retire it for good, so quarantine never becomes a place tests go to be forgotten.

Root Cause Analysis

Quarantine exists because the lifecycle of a flaky test rarely aligns with the lifecycle of a pull request. A nondeterministic failure surfaces on someone else’s unrelated change, blocks their merge, and the person best placed to fix it is not the person blocked. Without a containment mechanism, teams resolve this socially — re-running the build, merging through red, or muting the test in place — and every one of those habits degrades the signal of the whole suite. Once a green build is no longer believed, the suite stops being a gate and becomes theater.

The structural cause is that the merge gate and the flake-investigation timeline are coupled when they should be decoupled. Quarantine decouples them: the gate stays strict and fast, while flake remediation proceeds on its own schedule in a lane that reports without blocking. The risk to manage is that decoupling removes pressure, so quarantined tests stagnate and coverage silently erodes — which is why this workflow is built around ownership, visibility, and a hard exit criterion rather than just a tag. Deciding whether a given flaky test is even worth keeping draws on the same reasoning as the cost-benefit analysis of test layers: a high-value end-to-end check earns patient investigation, while a redundant one is better deleted than quarantined.

Reproducible Setup

Establish a tagging convention both runners can filter on, then split the pipeline.

npm install -D vitest @playwright/test

// playwright tag convention: append @flaky to the test title
import { test, expect } from '@playwright/test';

test('dashboard renders live totals @flaky', async ({ page }) => {
  await page.goto('/dashboard');
  await expect(page.getByTestId('total')).toBeVisible();
});

// vitest convention: a [quarantine] prefix, filtered by -t
import { test, expect } from 'vitest';

test('[quarantine] aggregates streamed events', async () => {
  expect(await aggregate()).toBe(42);
});

// vitest.config.ts — emit machine-readable results for the dashboard
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    reporters: ['default', 'json'],
    outputFile: { json: './vitest-results.json' },
  },
});

Implementation

Step 1 — Tag and annotate, do not comment out. A test.skip removes the test from all reporting; a tag keeps it running where you can watch it. Use the title tag for Playwright and a title prefix for Vitest. Add a code comment linking the tracking issue so the reason and owner travel with the test.

// link the owner and ticket inline so quarantine is never anonymous
// QUARANTINE owner:@checkout-team issue:#4821 since:2026-06-21
test('checkout completes under load @flaky', async ({ page }) => {
  /* ... */
});

Step 2 — Split the pipeline into a strict lane and an informational lane. The trusted job excludes quarantined tests and remains a hard merge gate. The quarantine job runs only the tagged tests with continue-on-error, so it reports trends without ever blocking a merge.

# .github/workflows/test.yml
jobs:
  trusted:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npx playwright test --grep-invert @flaky
      - run: npx vitest run -t '^(?!\[quarantine\]).*'

  quarantine:
    runs-on: ubuntu-latest
    continue-on-error: true   # never blocks the merge
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npx playwright test --grep @flaky --retries=2
      - run: npx vitest run -t '\[quarantine\]'
      - uses: actions/upload-artifact@v4
        with:
          name: quarantine-results
          path: '*-results.json'

Step 3 — Build a trend dashboard from the artifacts. A static count is not enough; you need the direction. Parse each run’s JSON, append the pass/fail/flaky counts to a history file, and chart it. Even a committed JSON ledger on a metrics branch is a start.

// scripts/quarantine-trend.ts
import { readFileSync, writeFileSync, existsSync } from 'node:fs';

type Snapshot = { date: string; total: number; failed: number; flaky: number };

const results = JSON.parse(readFileSync('./vitest-results.json', 'utf8'));
const tests = results.testResults?.flatMap((f: any) => f.assertionResults) ?? [];

const snapshot: Snapshot = {
  date: new Date().toISOString(),
  total: tests.length,
  failed: tests.filter((t: any) => t.status === 'failed').length,
  flaky: tests.filter((t: any) => t.status === 'flaky').length,
};

const history: Snapshot[] = existsSync('./quarantine-history.json')
  ? JSON.parse(readFileSync('./quarantine-history.json', 'utf8'))
  : [];
history.push(snapshot);
writeFileSync('./quarantine-history.json', JSON.stringify(history, null, 2));

The dashboard answers the only two questions that matter: is the quarantine lane growing, and is each resident getting more or less stable over time?

Step 4 — Define and enforce exit criteria. A test leaves quarantine in exactly one of two ways. It is promoted when its root cause is fixed and it then passes a defined run of consecutive green executions — 50 is a common bar — proving stability. Or it is retired: if no owner has fixed it by its deadline, deleting it is more honest than pretending a never-trusted test guards anything. Encode the promotion check so it is mechanical, not a judgment call.

// scripts/promotion-gate.ts — must be 100% green over the last N runs to promote
import { readFileSync } from 'node:fs';

const history = JSON.parse(readFileSync('./quarantine-history.json', 'utf8'));
const REQUIRED_GREEN = 50;
const recent = history.slice(-REQUIRED_GREEN);

const eligible =
  recent.length >= REQUIRED_GREEN &&
  recent.every((s: any) => s.failed === 0 && s.flaky === 0);

console.log(eligible ? 'PROMOTE: remove the tag' : 'HOLD: not yet stable');

Step 5 — Keep the lanes honest with the rest of the suite. The trusted lane must still catch real regressions instantly, so do not let quarantine become a dumping ground for tests that are merely slow or poorly written. Pair this with the retry discipline in retrying flaky Playwright tests without masking bugs so a test enters quarantine only after triage confirms it is genuinely nondeterministic and not a disguised product defect.

Verification

Prove the gate behaves correctly by exercising both lanes deliberately. Open a pull request that breaks a trusted test and confirm the merge button is disabled — the strict lane must veto. Then open a PR that breaks a quarantined test and confirm the merge stays enabled while the quarantine job reports the failure in its summary. This pair of checks is the contract that makes quarantine safe: regressions still block, flakes never do.

# Local sanity check: confirm the filters partition the suite correctly
npx playwright test --grep-invert @flaky --list   # trusted set
npx playwright test --grep @flaky --list           # quarantine set

The lists must be disjoint and together cover every test. For the dashboard, run the trend script across several CI runs and confirm the history file accumulates snapshots; a flat or shrinking flaky count over time is the evidence that the workflow is actually retiring instability rather than hoarding it.

Troubleshooting

Symptom: quarantined tests never leave the lane. Diagnosis: no exit criterion is enforced, so there is no mechanical signal to promote or retire. Fix: run the promotion gate on a schedule and require every quarantine entry to carry an owner and a deadline; auto-retire entries past their deadline by deleting the test and closing the ticket.

Symptom: the quarantine job’s failure accidentally blocks merges. Diagnosis: continue-on-error: true is missing on the job, or a required-status-check rule lists the quarantine job. Fix: confirm the flag is set and remove the quarantine job from the branch protection required checks so only the trusted lane gates merges.

Symptom: a test passes in quarantine but the same flow is failing for users. Diagnosis: the higher retries in the quarantine lane are masking a genuine product bug. Fix: this is not a flake — drop it from quarantine, file a defect, and reproduce it deterministically at a lower tier. The judgment of whether a flow even belongs at the end-to-end tier should follow the test pyramid strategy.

FAQ

How is quarantining different from just using test.skip?

test.skip removes the test from execution and reporting entirely, so you lose all visibility into whether the underlying problem is getting better or worse and you have no signal to ever re-enable it. Quarantine keeps the test running in a non-blocking lane, so it continues to generate pass/fail/flaky data you can chart and act on. The difference is the entire point: skip forgets, quarantine remembers.

Does this workflow apply to Vitest unit tests or only Playwright E2E?

It applies to both, and the mechanics are nearly identical — the only difference is the filter syntax, since Playwright matches title tags with --grep while Vitest matches test names with -t. Unit-level flakes are rarer and usually stem from shared state or unseeded randomness, which is often cheaper to fix outright than to quarantine. Reserve the lane for the genuinely hard-to-reproduce cases regardless of tier.

What’s a reasonable exit criterion for promoting a test back?

A common bar is 50 consecutive green runs in the quarantine lane after the root-cause fix lands, with zero flaky outcomes among them. The exact number matters less than making it mechanical and non-negotiable, so promotion is a measured event rather than someone optimistically removing a tag. Tie the gate to your trend history file so the check is automatic.

Won’t quarantine just hide problems like aggressive retries do?

It can, if you run it without ownership and exit criteria, which is exactly the failure mode this workflow is designed to prevent. The safeguards are that every quarantined test carries an owner and a deadline, the lane is dashboarded so a growing population is visible, and tests past their deadline are deleted rather than left to rot. Unlike a silent retry bump, quarantine keeps the failure in plain sight while it is contained.

Back to Flaky Test Mitigation
Retrying flaky Playwright tests without masking bugs — the triage that decides whether a test belongs in quarantine at all.
Deterministic seeding for test data in Vitest — often the fix that earns a test its way out of quarantine.
Cost-Benefit Analysis of Test Layers — decide whether a flaky test is worth saving or better deleted.