Seven principles for fixing a flaky E2E suite
A field report on rebuilding the E2E test automation around one feature of the PayPal Mobile App — what broke in the harness, what I changed, and the principles I'd reach for again.
The pipeline took 7–13 minutes to fail. The same build returned different results on consecutive runs. Engineers stopped trusting it; failures got re-run instead of investigated; "fixed" bugs kept coming back. A test suite that nobody listens to isn't a slow signal — it's no signal.
This wasn't a testing problem. It was an observability and feedback-loop problem. The suite was blind, slow, and mute: it couldn't say where it was, each debugging cycle burned minutes, and failures arrived as walls of noise instead of actionable reports. Once I framed it that way, the path forward became clear — and produced the seven principles below.
The same pipeline, before and after applying the seven principles.
Why it broke: two anti-patterns I had to kill first
Before explaining how we fixed it, it's worth naming what caused the mess. These two behaviors were quietly destroying the pipeline:
The timeout hammer
Every time a test got flaky, someone increased the timeout. It became the default fix for everything. The result: timeouts stacking on top of timeouts, single scenarios ballooning to 7–13 minutes, and the root cause never actually addressed.
Selector cascade chains
When a selector wasn't found, developers added fallback selectors — alternates that would sometimes show up. This felt like a fix but solved at the wrong abstraction layer. It made tests brittle, hard to read, and completely masked the real problem: the test had no idea where in the app it actually was.
Both patterns shared a root cause — the suite had no way to explain what was happening, so engineers patched around the symptoms. The fix required giving it eyes and a voice.
Act 1 — Speed the loop
Tighten the feedback loop. This is the #1 priority.
Before anything else, cycle time is the enemy. When you're debugging a flaky pipeline, you're going to run it dozens of times. The math makes this non-negotiable.
Let's assume it takes 50 cycles to reach a stable solution.
Stack that up. At the baseline, 150 seconds × 50 cycles is 125 minutes — roughly two hours of waiting on the same problem. With preflight checks, that same loop is 7 seconds × 50 cycles, or 5 minutes 50 seconds. Same number of debugging attempts, an order of magnitude less time staring at the screen.
Per failure, three steps down. Multiplied across 50 cycles, that's 2 hours collapsing to under six minutes.
Two hours versus six minutes. The speed of your loop determines how fast you can learn.
I attacked this in two moves:
Fail fast. I removed retries on failure classes that a retry could never fix. If the simulator isn't found, retrying three times doesn't help. Time to failure dropped from 2.5 minutes → 25 seconds.
Add preflight checks. I moved known failure conditions to the very start of the run — things like "simulator not found" or "app not installed." If the run is going to fail, it fails immediately. Time to failure: 25 seconds → 7 seconds.
That single principle — minimize cycle time — unlocked everything else.
Act 2 — Make the system legible
Once the loop was fast enough to iterate, I needed to actually understand what I was looking at.
Break runs into clearly marked phases
You can't categorize and solve systemic problems if you have no sense of structure. Different phases have different failure profiles. I drew hard lines:
- Preflight checks — fails in 7s if env is broken
- Hardware / env setup
- Test setup (users, login, session)
- Test execution
- Teardown
When something fails, you immediately know where it failed. That matters enormously for triage.
Heavily simplified — a real run is ~10× longer.
Invest heavily in reporting. Timestamps everywhere.
When I arrived, every run produced a wall of undifferentiated logs. There was no way to correlate what I saw on the simulator with what the logs said.
I wired in live step reporting — timestamped, human-readable logs that narrate exactly what's happening on screen in real time. This solved two problems: debugging live runs, and reconstructing what happened after a failure. Bonus: it's also clean input for an LLM to reason about.
The TIMELINE block is the headline: which stage burned the time, which one failed.
At a glance: which stage burned the time, which stage failed, and how
the rest compared. That's a TIMELINE doing the job a wall of [CONFIG]
lines never could.
Build the right primitives — the "where am I" assertion
The most common failure message in native mobile testing is: cannot find selector.
That error is useless. It's like getting a call from someone who is blindfolded saying, "I can't find the door handle". You don't know what room they're in, let alone whether they're in the right house. Helping them becomes almost impossible without that extra context.
I fixed this by building a page/view registry:
- Every screen in the app is registered.
- After every navigation event (button tap, back swipe, deep link), assert: "Am I on the page I'm supposed to be on?"
- On every test failure, print full "where am I" context.
Now when a test fails looking for a login button, I know immediately whether
it's on the login screen, an error page, or somewhere completely unexpected.
cannot find selector becomes expected: LoginScreen, actual: ErrorPage —
which is actually actionable.
This also made the selector cascade anti-pattern unnecessary. You don't need fallback selectors if you know exactly where you are before you reach for one.
Expected/actual page, hint, and nav path collapse triage from 'find the cause' to 'check the auth flow'.
Act 3 — Make failures actionable, track everything
Now that the system is legible, failures need to mean something.
Failures should light a path to resolution
A failure report isn't done when it tells you what broke. It's done when it tells you where to look next.
In an LLM-assisted workflow, that means:
- An "investigate this" prompt the developer can paste directly into an LLM.
- Logs, screenshots, and error messages attached as context.
- A full view-hierarchy dump (the mobile equivalent of a DOM dump) at point of failure.
- Network requests and payloads, so an LLM can identify whether the failure was a frontend issue or a backend response problem.
The failure becomes a briefing, not a mystery.
Name and eradicate anti-patterns actively
Once the system is legible, the damaging patterns become visible. Name them. Block them in code review. I'm formalizing explicit team rules now:
- No timeout hammer. Increasing a timeout is not a bug fix — it's a smell. PRs that bump timeouts without an attached root-cause require justification.
- No selector cascades. Fallback-selector chains are banned. If a selector is unstable, fix the page registry, not the call site.
The goal: bake these into how the team talks about test quality, not just what reviewers look for in a diff.
Keep a ledger
Before this work, I had no way to track failure trends over time. No categories, no stage tagging, no history.
Now every run is logged with stage, category, reason, and links to logs and screenshots. That turns anecdotal "it feels flaky" into "68% of failures happen in test setup, and half of those are auth-related." The difference between guessing and diagnosing.
Scenario: checkout-flow happy-path — captured during one debugging session.
Nine of ten runs failed. Three failure categories all surfaced the same root cause (auth/session in test setup) — invisible in the old wall-of-logs world.
The future: self-healing pipelines
Here's what becomes possible once failures are structured, contextualized, and LLM-readable: an agent that scans failed jobs and proposes fixes.
Not autonomously. The guardrails really matter here.
The incentives have to stay aligned.
An AI that makes tests green by deleting the assertions that catch real bugs is worse than no AI at all — it's an actively dangerous pipeline that projects false confidence.
The right model:
- Agent proposes fixes via PR. Human approves.
- Known failure patterns get auto-remediation unlocked over time.
- Novel failure signatures get flagged for human review.
The foundation I built — phases, structured logs, page registry, contextual error reports — makes this possible. Without it, you're feeding an LLM noise. With it, you're giving it a clear brief.
Before / After: a real run
Both runs heavily redacted. The shape is what matters: undifferentiated noise vs. phased, named, contextual output.
The Before column above isn't a strawman. It's a real day, working a real
flake — the wall of [CONFIG] and INFO @wdio lines hiding the one thing
I actually needed. The After is the same scenario, same failure, after
seven principles' worth of work.
Conclusion
The headline number — 13 minutes to 7 seconds — is real, and it's the wrong thing to focus on. Faster tests didn't fix the pipeline. They were a side effect of the actual fix: a system I could see into.
That's the lesson worth keeping. When you can't fix something, it's usually because you can't see it. The cycle of "increase the timeout, add a fallback selector, hope for the best" is what happens when a system has no way to tell you what it's doing. Give it phases, timestamps, page context, structured failure reports — and the same problems that felt unfixable last quarter become tractable engineering work this quarter.
A pipeline that communicates is one you can actually trust. Build that first; the speed and the AI assistance will follow.