The article discusses the problem of flaky tests in software development and proposes a simple mechanical habit: when using a merge queue, continue to run the full test suite on main and maintain a visible list of recent main failures to help identify and eradicate flaky tests.
<header>
<h1>Catch Flakes On Main</h1>
<time class="meta" datetime="2026-05-14">May 14, 2026</time>
</header>
<p>A small <a href="https://matklad.github.io/2025/12/06/mechanical-habits.html"><em>Mechanical Habit</em></a> today:</p>
<figure class="blockquote">
<blockquote><p>When using not rocket science rule / merge queue, continue to redundantly run the full test suite
on main. Maintain an easily accessible list of recent main failures — these are the flaky tests
to eradicate.</p>
</blockquote>
</figure>
<p>For an example, see the “Flakes” link on
<a href="https://devhub.tigerbeetle.com" class="display url">https://devhub.tigerbeetle.com</a></p>
<figure>
<img alt="" src="https://github.com/user-attachments/assets/09dffa9e-cdae-48e2-a4ed-80278da53bb2" width="2310" height="1746">
</figure>
<p>Flaky tests are tests that fail intermittently, once in a thousand runs. This might be due to a
genuine bug (assumptions about scheduling that <em>mostly</em> hold) or due to instability of underlying
infrastructure (e.g., inability to download a release from GitHub, or to delete a folder on Windows).
In either case, flaky tests are a huge productivity drain — as the size and complexity of test
suite grows, more and more CI runs fail spuriously, even as each individual test almost always
passes.</p>
<p>Flaky tests are challenging to deal with — if you are working on landing a PR and your CI fails
due to an obvious flake, the temptation to just re-run the test suite is enormous, especially if
there’s a certain background dissatisfaction with infrastructure stability.</p>
<p>If you are of a mind to do some flake squashing, then your PRs will be green just to spite you! And
working off of others’ PRs would require first to separate flakes from genuine failures.</p>
<p>This is why the merge queue is powerful: if there’s a guarantee that every commit on the main branch
passes the tests, then every failure on main is a flake, by definition. Collecting all such
failures into a single list compresses time, allows to prioritize the most impactful sources of
instability, and reveals correlations between failures.</p>
# Catch Flakes On Main
Source: [https://matklad.github.io/2026/05/14/catch-flakes-on-main.html](https://matklad.github.io/2026/05/14/catch-flakes-on-main.html)
May 14, 2026A small[*Mechanical Habit*](https://matklad.github.io/2025/12/06/mechanical-habits.html)today:
> When using not rocket science rule / merge queue, continue to redundantly run the full test suite on main\. Maintain an easily accessible list of recent main failures — these are the flaky tests to eradicate\.
For an example, see the “Flakes” link on[https://devhub\.tigerbeetle\.com](https://devhub.tigerbeetle.com/)
Flaky tests are tests that fail intermittently, once in a thousand runs\. This might be due to a genuine bug \(assumptions about scheduling that*mostly*hold\) or due to instability of underlying infrastructure \(e\.g\., inability to download a release from GitHub, or to delete a folder on Windows\)\. In either case, flaky tests are a huge productivity drain — as the size and complexity of test suite grows, more and more CI runs fail spuriously, even as each individual test almost always passes\.
Flaky tests are challenging to deal with — if you are working on landing a PR and your CI fails due to an obvious flake, the temptation to just re\-run the test suite is enormous, especially if there’s a certain background dissatisfaction with infrastructure stability\.
If you are of a mind to do some flake squashing, then your PRs will be green just to spite you\! And working off of others’ PRs would require first to separate flakes from genuine failures\.
This is why the merge queue is powerful: if there’s a guarantee that every commit on the main branch passes the tests, then every failure on main is a flake, by definition\. Collecting all such failures into a single list compresses time, allows to prioritize the most impactful sources of instability, and reveals correlations between failures\.
Describes a loop command in Cursor to automatically fix flaky tests by running the test suite multiple times, collecting intermittent failures, and fixing or quarantining them until five consecutive green runs.
The author consolidates a series of articles on software testing fundamentals, covering topics such as the purpose of testing, assertions, code coverage, and handling flaky tests.
A developer shares a workflow using Cursor's Opus 4.8 Max Thinking model with subagent harness, and introduces a GitHub repository with installable skill files for AI coding agents, including a 'running-bug-review-board' skill that performs live QA testing.
Jarred Sumner shares a favorite test failure during Bun's Rust rewrite: TOML and YAML parsers stack overflow tests failed because the Rust implementation could handle deeper nesting than expected.
The article argues that teams should choose boring, well-understood technology for reliability, while being free to innovate in development practices like TCR (test && commit || revert), which are easier to adopt and abandon without long-term maintenance burden.