We've all been there. You push a commit, the pipeline goes green. You push the exact same commit again — red. No code changes, no config changes, just the CI gods deciding today isn't your day.
I spent the better part of last month chasing down intermittent CI failures across three different projects, and what I found was a pattern. These "random" failures almost never are random. They have root causes, and once you know where to look, they're surprisingly fixable.
The Three Usual Suspects
After years of debugging flaky pipelines, I've found that roughly 90% of intermittent CI failures fall into three buckets:
- Race conditions in tests — tests that depend on timing, ordering, or shared state
- Resource exhaustion — the runner ran out of memory, disk, or hit a CPU ceiling
- External dependency flakiness — a registry, API, or DNS lookup that occasionally times out
Let's dig into each one, because the debugging approach is different for all three.
Race Conditions in Tests
This is the big one. You write a test that passes locally every single time, but in CI it fails maybe 1 in 10 runs. Classic.
The problem is usually that your local machine is fast enough to paper over a timing issue. CI runners are shared, throttled, and generally slower — which exposes the race.
Here's a pattern I see constantly in integration tests:
# BAD: assumes the async operation completes within some magic window
def test_user_creation():
create_user_async("testuser")
time.sleep(2) # "should be enough" — famous last words
user = get_user("testuser")
assert user is not NoneThe fix is to poll with a timeout instead of sleeping for a fixed duration:
# GOOD: poll until ready, fail with a clear timeout
import time
def wait_for(predicate, timeout=10, interval=0.5):
"""Poll a condition instead of guessing how long to sleep."""
start = time.time()
while time.time() - start < timeout:
if predicate():
return True
time.sleep(interval)
raise TimeoutError(f"Condition not met within {timeout}s")
def test_user_creation():
create_user_async("testuser")
wait_for(lambda: get_user("testuser") is not None)
user = get_user("testuser")
assert user is not NoneAnother huge source of test flakiness: shared state between tests. If test A writes to a database and test B reads from it, you're at the mercy of execution order. Run your tests with --randomize-order flags locally. If things break, you found your problem.
Resource Exhaustion
This one is sneaky because the error messages are often misleading. Your test doesn't fail with "out of memory" — it fails with some cryptic segfault, a killed process, or a container that just... stops.
The first thing I do when I suspect resource issues is add monitoring to the pipeline itself:
# GitHub Actions example: log resource usage at each step
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Check available resources
run: |
echo "=== Memory ==="
free -h
echo "=== Disk ==="
df -h
echo "=== CPU ==="
nproc
- name: Run tests
run: |
# Run tests with memory limit to catch leaks early
# instead of letting the OOM killer surprise you
ulimit -v 4000000 # ~4GB virtual memory cap
make test
- name: Post-test resources
if: always() # run even if tests fail
run: |
free -h
df -hCommon culprits here:
- Docker layer caching gone wrong — your image cache fills the disk over time
- Test parallelism set too high — 8 parallel test workers on a 2-core runner is a recipe for OOM kills
- Log accumulation — tests that write verbose logs to disk without cleanup
A quick win: if you're running parallel tests, try cutting the parallelism in half. I know it feels wrong to make CI slower, but a pipeline that takes 8 minutes and passes is infinitely better than one that takes 5 minutes and fails 30% of the time.
External Dependency Failures
This is the one that makes me want to flip a table. Your pipeline fails because registry.npmjs.org had a 3-second hiccup, or because the Docker Hub rate limit kicked in, or because some DNS resolution was briefly flaky.
The fix here is layers of defense:
First, add retries to your package installation steps. Most package managers support this natively:# GitHub Actions with retry logic for npm
- name: Install dependencies
run: npm ci --retry 3
env:
# Increase timeout for slow registries
NPM_CONFIG_FETCH_TIMEOUT: 60000
# Or for pip
- name: Install Python deps
run: pip install -r requirements.txt --retries 5 --timeout 60node_modules, pip packages, or Docker layers in CI, you're downloading the entire internet on every build. Every network call is a potential failure point.
Third, consider a pull-through cache or mirror for Docker images if you're hitting rate limits. Tools like Harbor or a simple registry mirror can save you from Docker Hub's pull limits.
The Debugging Workflow
When you hit a flaky failure and don't know which category it falls into, here's my go-to process:
# Run a single test 50 times to catch intermittent failures
for i in $(seq 1 50); do
echo "Run $i"
pytest tests/test_users.py::test_user_creation -x || { echo "FAILED on run $i"; exit 1; }
donePrevention: Stop Flaky Tests Before They Merge
The best flaky test is the one that never reaches your main branch. A few things that have genuinely helped me:
- Quarantine known flaky tests — mark them, track them, and don't let them block deployments. Fix them, but don't let them hold the team hostage in the meantime.
- Run tests with randomized ordering in CI — catches hidden ordering dependencies before they become a problem.
- Set memory limits explicitly — don't rely on the runner's defaults. If your tests need 4GB, say so. If they're using more, that's a bug.
- Pin your dependencies — a floating version in your lockfile can introduce a new transitive dependency that changes behavior between runs.
- Track flaky test rates over time — even a simple script that counts how often each test fails over the last 100 runs will tell you where to focus.
The Uncomfortable Truth
Here's what nobody wants to hear: most flaky CI pipelines are a symptom of flaky tests, and flaky tests are a symptom of code that's hard to test deterministically. The real fix is often to refactor the code under test, not just patch the test.
But that's a longer conversation, and sometimes you just need the pipeline to stop failing at 5 PM on a Friday. Start with the debugging workflow above, fix the immediate issue, and then — when you have time — look at whether the underlying code needs some love too.
Flaky pipelines are solvable. They're annoying, they're time-consuming, but they're not mysterious. Every "random" failure has a cause. You just have to be stubborn enough to find it.
