Seven principles for fixing a flaky E2E suite

April 28, 2026

Seven principles for fixing a flaky E2E suite

A field report on rebuilding the E2E test automation around one feature of the PayPal Mobile App — what broke in the harness, what I changed, and the principles I'd reach for again.

The pipeline took 7–13 minutes to fail. The same build returned different results on consecutive runs. Engineers stopped trusting it; failures got re-run instead of investigated; "fixed" bugs kept coming back. A test suite that nobody listens to isn't a slow signal — it's no signal.

This wasn't a testing problem. It was an observability and feedback-loop problem. The suite was blind, slow, and mute: it couldn't say where it was, each debugging cycle burned minutes, and failures arrived as walls of noise instead of actionable reports. Once I framed it that way, the path forward became clear — and produced the seven principles below.

Before / After

The same pipeline, before and after applying the seven principles.

Metric
Before
After
Time to failure (worst case)
7–13 min
7s (preflight)
Cycle time for debugging
~150s
~7s
Failure signal quality
Wall of logs
Staged · timestamped · contextual
Failure actionability
"Cannot find selector"
Page context + LLM prompt
Trend tracking
None
Per-stage ledger with full history

Why it broke: two anti-patterns I had to kill first

Before explaining how we fixed it, it's worth naming what caused the mess. These two behaviors were quietly destroying the pipeline:

The timeout hammer

Every time a test got flaky, someone increased the timeout. It became the default fix for everything. The result: timeouts stacking on top of timeouts, single scenarios ballooning to 7–13 minutes, and the root cause never actually addressed.

Selector cascade chains

When a selector wasn't found, developers added fallback selectors — alternates that would sometimes show up. This felt like a fix but solved at the wrong abstraction layer. It made tests brittle, hard to read, and completely masked the real problem: the test had no idea where in the app it actually was.

Both patterns shared a root cause — the suite had no way to explain what was happening, so engineers patched around the symptoms. The fix required giving it eyes and a voice.


Act 1 — Speed the loop

Principle 01

Tighten the feedback loop. This is the #1 priority.

Before anything else, cycle time is the enemy. When you're debugging a flaky pipeline, you're going to run it dozens of times. The math makes this non-negotiable.

Let's assume it takes 50 cycles to reach a stable solution.

Each run shows the time until failure after one more lever was applied
01
2 min 30s
Baseline
02
25s
+ Fast fail
no retries · fast failures
03
7s
+ Preflight checks
preflight on guaranteed failures

Stack that up. At the baseline, 150 seconds × 50 cycles is 125 minutes — roughly two hours of waiting on the same problem. With preflight checks, that same loop is 7 seconds × 50 cycles, or 5 minutes 50 seconds. Same number of debugging attempts, an order of magnitude less time staring at the screen.

Why cycle time compounds
Runs
Before After speed improvements
×1
2 min 30s
7s
×10
25 min
1 min 10s
×25
1 hr 3 min
2 min 55s
×50
2 hr 5 min
5 min 50s
log scale · cumulative time

Per failure, three steps down. Multiplied across 50 cycles, that's 2 hours collapsing to under six minutes.

Two hours versus six minutes. The speed of your loop determines how fast you can learn.

I attacked this in two moves:

Fail fast. I removed retries on failure classes that a retry could never fix. If the simulator isn't found, retrying three times doesn't help. Time to failure dropped from 2.5 minutes → 25 seconds.

Add preflight checks. I moved known failure conditions to the very start of the run — things like "simulator not found" or "app not installed." If the run is going to fail, it fails immediately. Time to failure: 25 seconds → 7 seconds.

That single principle — minimize cycle time — unlocked everything else.


Act 2 — Make the system legible

Once the loop was fast enough to iterate, I needed to actually understand what I was looking at.

Principle 02

Break runs into clearly marked phases

You can't categorize and solve systemic problems if you have no sense of structure. Different phases have different failure profiles. I drew hard lines:

  1. Preflight checks — fails in 7s if env is broken
  2. Hardware / env setup
  3. Test setup (users, login, session)
  4. Test execution
  5. Teardown

When something fails, you immediately know where it failed. That matters enormously for triage.

Heavily simplified — a real run is ~10× longer.

Before: ❯ ./run-e2e.sh starting tests... INFO: launching simulator WARN: retry 1/3 WARN: retry 2/3 ERROR: cannot find selector FAIL (after 7m 41s) After: ❯ ./run-e2e.sh [00:00] [preflight] ✓ simulator available [00:01] [preflight] ✓ app installed [00:02] [env setup] ✓ device booted [00:08] [test setup] ✓ test user provisioned [00:14] [test setup] ✓ session authenticated [00:15] [execution] → tap "Send" [00:16] [execution] ✗ expected: SendScreen, actual: ErrorPage FAIL @ execution (16s)
Principle 03

Invest heavily in reporting. Timestamps everywhere.

When I arrived, every run produced a wall of undifferentiated logs. There was no way to correlate what I saw on the simulator with what the logs said.

I wired in live step reporting — timestamped, human-readable logs that narrate exactly what's happening on screen in real time. This solved two problems: debugging live runs, and reconstructing what happened after a failure. Bonus: it's also clean input for an LLM to reason about.

The TIMELINE block is the headline: which stage burned the time, which one failed.

Before: ❯ npm test [DEBUG] TAG_EXPRESSION: <test-id> [DEBUG] isNoAppTest: false [CONFIG] Running tests for platform: ios on simulator [CONFIG] BrowserStack mode: DISABLED [CONFIG] Release target: all [CONFIG] Video recording: ENABLED [CONFIG] Restart emulator per scenario: DISABLED INFO @wdio/cli:launcher: Run onPrepare hook WARN @wdio/config:ConfigParser: pattern did not match WARN @wdio/config:ConfigParser: pattern did not match [0-0] [DEBUG] TAG_EXPRESSION: <test-id> [0-0] [DEBUG] isNoAppTest: false [0-0] [CONFIG] Running tests for platform: ios on simulator [0-0] [CONFIG] BrowserStack mode: DISABLED [0-0] [CONFIG] Release target: all [0-0] [CONFIG] Video recording: ENABLED […] (~280 more lines of similar config / init noise) FAIL "the user is redirected" Error: cannot find element After: ❯ npm test ✓ preflight (8.2s) ✓ SETUP (32.7s) ✓ APP-ENV (47.6s) ✓ LOGIN (5.9s) ✓ MERCHANT-CONFIG (30.3s) ✗ TEST (41.7s) app switch failed ━━━━━━━━━━━━━━━━━━━ TIMELINE ━━━━━━━━━━━━━━━━━━━ SETUP 32.7s ██████ APP-ENV 47.6s █████████ LOGIN 5.9s █ MERCHANT-CONFIG 30.3s ██████ TEST 41.7s ████████ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ FAIL @ TEST · run-id <run-id>

At a glance: which stage burned the time, which stage failed, and how the rest compared. That's a TIMELINE doing the job a wall of [CONFIG] lines never could.

Principle 04

Build the right primitives — the "where am I" assertion

The most common failure message in native mobile testing is: cannot find selector.

That error is useless. It's like getting a call from someone who is blindfolded saying, "I can't find the door handle". You don't know what room they're in, let alone whether they're in the right house. Helping them becomes almost impossible without that extra context.

I fixed this by building a page/view registry:

  • Every screen in the app is registered.
  • After every navigation event (button tap, back swipe, deep link), assert: "Am I on the page I'm supposed to be on?"
  • On every test failure, print full "where am I" context.

Now when a test fails looking for a login button, I know immediately whether it's on the login screen, an error page, or somewhere completely unexpected. cannot find selector becomes expected: LoginScreen, actual: ErrorPage — which is actually actionable.

This also made the selector cascade anti-pattern unnecessary. You don't need fallback selectors if you know exactly where you are before you reach for one.

Expected/actual page, hint, and nav path collapse triage from 'find the cause' to 'check the auth flow'.

Before: ❯ test "send money" run #142 […] AssertionError: cannot find selector "send-button" timeout: 30000ms ¯\_(ツ)_/¯ After: ❯ test "send money" run #143 […] PageMismatch: expected: SendMoneyScreen actual: SessionExpiredErrorPage hint: auth flow likely failed in test setup context: /screens/send → /errors/session_expired

Act 3 — Make failures actionable, track everything

Now that the system is legible, failures need to mean something.

Principle 05

Failures should light a path to resolution

A failure report isn't done when it tells you what broke. It's done when it tells you where to look next.

In an LLM-assisted workflow, that means:

  • An "investigate this" prompt the developer can paste directly into an LLM.
  • Logs, screenshots, and error messages attached as context.
  • A full view-hierarchy dump (the mobile equivalent of a DOM dump) at point of failure.
  • Network requests and payloads, so an LLM can identify whether the failure was a frontend issue or a backend response problem.

The failure becomes a briefing, not a mystery.

auto-generated · paste into LLMInvestigate this E2E failure. Test: send_money_happy_path Failed at: [execution] step 4 — tap "Send" Expected: SendMoneyScreen Actual: SessionExpiredErrorPage Phase: execution Attached: - logs/run-1283.log (timestamped, per-stage) - screenshots/run-1283/before-tap.png - screenshots/run-1283/after-tap.png - view-hierarchy/run-1283.json - network/run-1283.har Hypothesis surface area: - test setup auth (token expiry?) - backend session refresh - frontend nav guard What's the most likely cause, and what would you check next?
Principle 06

Name and eradicate anti-patterns actively

Once the system is legible, the damaging patterns become visible. Name them. Block them in code review. I'm formalizing explicit team rules now:

  • No timeout hammer. Increasing a timeout is not a bug fix — it's a smell. PRs that bump timeouts without an attached root-cause require justification.
  • No selector cascades. Fallback-selector chains are banned. If a selector is unstable, fix the page registry, not the call site.

The goal: bake these into how the team talks about test quality, not just what reviewers look for in a diff.

Principle 07

Keep a ledger

Before this work, I had no way to track failure trends over time. No categories, no stage tagging, no history.

Now every run is logged with stage, category, reason, and links to logs and screenshots. That turns anecdotal "it feels flaky" into "68% of failures happen in test setup, and half of those are auth-related." The difference between guessing and diagnosing.

Run history · ten most recent

Scenario: checkout-flow happy-path — captured during one debugging session.

#
Date
Dur
Status
Build
Scenario
Failure
1
Apr 28 20:37
2m 32s
PASS
9.4.444
checkout-flow happy-path
2
Apr 28 20:20
2m 42s
FAIL
9.4.444
checkout-flow happy-path
Wrong Screen · Unknown screen
3
Apr 28 20:13
3m 01s
FAIL
9.4.444
checkout-flow happy-path
Uncategorized · Checkout Screen
4
Apr 28 20:05
2m 29s
FAIL
9.4.444
checkout-flow happy-path
Error Page · Non-Recoverable Error
5
Apr 28 19:51
3m 09s
FAIL
checkout-flow happy-path
Element Not Found · Unknown screen
6
Apr 28 19:44
2m 51s
FAIL
checkout-flow happy-path
Payment sheet didn't load · Unknown screen
7
Apr 28 19:26
CRASH
(no scenario)
8
Apr 28 19:20
3m 19s
FAIL
checkout-flow happy-path
Element Not Found · Checkout Screen
9
Apr 28 19:12
2m 08s
FAIL
checkout-flow happy-path
Error Page · Blank WebView
10
Apr 28 18:38
2m 18s
FAIL
checkout-flow happy-path
Error Page · Blank WebView

Nine of ten runs failed. Three failure categories all surfaced the same root cause (auth/session in test setup) — invisible in the old wall-of-logs world.


The future: self-healing pipelines

Here's what becomes possible once failures are structured, contextualized, and LLM-readable: an agent that scans failed jobs and proposes fixes.

Not autonomously. The guardrails really matter here.

The incentives have to stay aligned.

An AI that makes tests green by deleting the assertions that catch real bugs is worse than no AI at all — it's an actively dangerous pipeline that projects false confidence.

The right model:

  • Agent proposes fixes via PR. Human approves.
  • Known failure patterns get auto-remediation unlocked over time.
  • Novel failure signatures get flagged for human review.

The foundation I built — phases, structured logs, page registry, contextual error reports — makes this possible. Without it, you're feeding an LLM noise. With it, you're giving it a clear brief.


Before / After: a real run

Both runs heavily redacted. The shape is what matters: undifferentiated noise vs. phased, named, contextual output.

Before: ❯ ./run-e2e.sh --test <test-id> ========================================== Test Configuration: Flow: <flow> Integration: API ========================================== Checking iOS simulator status... Found booted simulator: iPhone 17 Pro Max > @org/e2e-runner@1.3.0 test > wdio run ./wdio.conf.ts [DEBUG] TAG_EXPRESSION: <test-id> [DEBUG] isNoAppTest: false [CONFIG] Running tests for platform: ios on simulator [CONFIG] BrowserStack mode: DISABLED [CONFIG] Release target: all [CONFIG] Video recording: ENABLED [CONFIG] Restart emulator per scenario: DISABLED INFO @wdio/cli:launcher: Run onPrepare hook WARN @wdio/config:ConfigParser: pattern did not match WARN @wdio/config:ConfigParser: pattern did not match Execution of 30 workers started [0-0] [DEBUG] TAG_EXPRESSION: <test-id> [0-0] [DEBUG] isNoAppTest: false [0-0] [CONFIG] Running tests for platform: ios on simulator [0-0] [CONFIG] BrowserStack mode: DISABLED [0-0] [CONFIG] Release target: all [0-0] [CONFIG] Video recording: ENABLED [0-0] INFO @wdio/cli: SKIPPED iOS(26.2) [0-0] INFO Run onWorkerEnd hook […] (~600 more lines: workers, retries, configs) [FAIL] 1 of 1 scenarios FAILED Error: cannot find element After: ❯ npm test Capturing environment snapshot... Xcode 26.4.1 Target: iPhone 17 Pro Max · iOS 26.1 App: app.example.checkout Running preflight checks... ✓ 9/9 checks passed Environment ready (8.2s) ━━━ SETUP ━━━ ▸ Session starting iPhone 17 Pro Max · iOS 26.1 Driver: XCUITest ━━━ SETUP complete (32.7s) ━━━ ━━━ APP-ENV ━━━ ▸ [action] configure app environment ✓ Already configured — verified ━━━ APP-ENV complete (47.6s) ━━━ ━━━ LOGIN ━━━ ▸ [action] ensure user is logged in ✓ Already logged in — skipping ━━━ LOGIN complete (5.9s) ━━━ ━━━ MERCHANT-CONFIG ━━━ ▸ [action] navigate to checkout ✓ Already Stage — verified ━━━ MERCHANT-CONFIG complete (30.3s) ━━━ ━━━ TEST ━━━ ▸ [action] tap pay button ✓ tapped (1.3s) ▸ [action] expect app foreground (app switch) ✗ Switch Not Triggered — expected: PayPal, actual: Merchant ──── FAILURE ──── Stage: TEST Category: App Switch Failure › Switch Not Triggered Owner: us Screen: Checkout (Merchant) Duration: 34.1s ──── COPY FOR LLM ──── /triage-test-failure <failure> run_id: <run-id> category: App Switch Failure › Switch Not Triggered attached: - screenshot, page_source.xml, video.mp4 - curated_log (app errors, GraphQL, auth flow) - trace_log (chronological actions) </failure> ✓ SETUP (32.7s) → ✓ APP-ENV (47.6s) → ✓ LOGIN (5.9s) → ✓ MERCHANT-CONFIG (30.3s) → ✗ TEST (41.7s) [FAIL] checkout-flow happy-path FAILED (run-id <run-id>)

The Before column above isn't a strawman. It's a real day, working a real flake — the wall of [CONFIG] and INFO @wdio lines hiding the one thing I actually needed. The After is the same scenario, same failure, after seven principles' worth of work.


Conclusion

The headline number — 13 minutes to 7 seconds — is real, and it's the wrong thing to focus on. Faster tests didn't fix the pipeline. They were a side effect of the actual fix: a system I could see into.

That's the lesson worth keeping. When you can't fix something, it's usually because you can't see it. The cycle of "increase the timeout, add a fallback selector, hope for the best" is what happens when a system has no way to tell you what it's doing. Give it phases, timestamps, page context, structured failure reports — and the same problems that felt unfixable last quarter become tractable engineering work this quarter.

A pipeline that communicates is one you can actually trust. Build that first; the speed and the AI assistance will follow.