Seven principles for fixing a flaky E2E suite

A field report on rebuilding the E2E test automation around one feature of the PayPal Mobile App — what broke in the harness, what I changed, and the principles I'd reach for again.

The pipeline took 7–13 minutes to fail. The same build returned different results on consecutive runs. Engineers stopped trusting it; failures got re-run instead of investigated; "fixed" bugs kept coming back. A test suite that nobody listens to isn't a slow signal — it's no signal.

This wasn't a testing problem. It was an observability and feedback-loop problem. The suite was blind, slow, and mute: it couldn't say where it was, each debugging cycle burned minutes, and failures arrived as walls of noise instead of actionable reports. Once I framed it that way, the path forward became clear — and produced the seven principles below.

Before / After

The same pipeline, before and after applying the seven principles.

Metric

Before

After

Time to failure (worst case)

7–13 min

7s (preflight)

Cycle time for debugging

~150s

~7s

Failure signal quality

Wall of logs

Staged · timestamped · contextual

Failure actionability

"Cannot find selector"

Page context + LLM prompt

Trend tracking

None

Per-stage ledger with full history

Why it broke: two anti-patterns I had to kill first

Before explaining how we fixed it, it's worth naming what caused the mess. These two behaviors were quietly destroying the pipeline:

The timeout hammer

Every time a test got flaky, someone increased the timeout. It became the default fix for everything. The result: timeouts stacking on top of timeouts, single scenarios ballooning to 7–13 minutes, and the root cause never actually addressed.

Selector cascade chains

When a selector wasn't found, developers added fallback selectors — alternates that would sometimes show up. This felt like a fix but solved at the wrong abstraction layer. It made tests brittle, hard to read, and completely masked the real problem: the test had no idea where in the app it actually was.

Both patterns shared a root cause — the suite had no way to explain what was happening, so engineers patched around the symptoms. The fix required giving it eyes and a voice.

Act 1 — Speed the loop

Principle 01

Tighten the feedback loop. This is the #1 priority.

Before anything else, cycle time is the enemy. When you're debugging a flaky pipeline, you're going to run it dozens of times. The math makes this non-negotiable.

Let's assume it takes 50 cycles to reach a stable solution.

Each run shows the time until failure after one more lever was applied

2 min 30s

Baseline

25s

+ Fast fail

no retries · fast failures

+ Preflight checks

preflight on guaranteed failures

Stack that up. At the baseline, 150 seconds × 50 cycles is 125 minutes — roughly two hours of waiting on the same problem. With preflight checks, that same loop is 7 seconds × 50 cycles, or 5 minutes 50 seconds. Same number of debugging attempts, an order of magnitude less time staring at the screen.

Why cycle time compounds

Runs

Before After speed improvements

30s1m5m10m30m2 hr 5 min

×1

2 min 30s

×10

25 min

1 min 10s

×25

1 hr 3 min

2 min 55s

×50

2 hr 5 min

5 min 50s

log scale · cumulative time

Per failure, three steps down. Multiplied across 50 cycles, that's 2 hours collapsing to under six minutes.

Two hours versus six minutes. The speed of your loop determines how fast you can learn.

I attacked this in two moves:

Fail fast. I removed retries on failure classes that a retry could never fix. If the simulator isn't found, retrying three times doesn't help. Time to failure dropped from 2.5 minutes → 25 seconds.

Add preflight checks. I moved known failure conditions to the very start of the run — things like "simulator not found" or "app not installed." If the run is going to fail, it fails immediately. Time to failure: 25 seconds → 7 seconds.

That single principle — minimize cycle time — unlocked everything else.

Act 2 — Make the system legible

Once the loop was fast enough to iterate, I needed to actually understand what I was looking at.

Principle 02

Break runs into clearly marked phases

You can't categorize and solve systemic problems if you have no sense of structure. Different phases have different failure profiles. I drew hard lines:

Preflight checks — fails in 7s if env is broken
Hardware / env setup
Test setup (users, login, session)
Test execution
Teardown

When something fails, you immediately know where it failed. That matters enormously for triage.

Heavily simplified — a real run is ~10× longer.

Principle 03

Invest heavily in reporting. Timestamps everywhere.

When I arrived, every run produced a wall of undifferentiated logs. There was no way to correlate what I saw on the simulator with what the logs said.

I wired in live step reporting — timestamped, human-readable logs that narrate exactly what's happening on screen in real time. This solved two problems: debugging live runs, and reconstructing what happened after a failure. Bonus: it's also clean input for an LLM to reason about.

reporting style

❯ npm test

[DEBUG] TAG_EXPRESSION: <test-id>

[DEBUG] isNoAppTest: false

[CONFIG] Running tests for platform: ios on simulator

[CONFIG] BrowserStack mode: DISABLED

[CONFIG] Release target: all

[CONFIG] Video recording: ENABLED

[CONFIG] Restart emulator per scenario: DISABLED

INFO @wdio/cli:launcher: Run onPrepare hook

WARN @wdio/config:ConfigParser: pattern did not match

[0-0] [DEBUG] TAG_EXPRESSION: <test-id>

[0-0] [DEBUG] isNoAppTest: false

[0-0] [CONFIG] Running tests for platform: ios on simulator

[0-0] [CONFIG] BrowserStack mode: DISABLED

[0-0] [CONFIG] Release target: all

[0-0] [CONFIG] Video recording: ENABLED

[…] (~280 more lines of similar config / init noise)

FAIL "the user is redirected"

Error: cannot find element

The TIMELINE block is the headline: which stage burned the time, which one failed.

At a glance: which stage burned the time, which stage failed, and how the rest compared. That's a TIMELINE doing the job a wall of [CONFIG] lines never could.

Principle 04

Build the right primitives — the "where am I" assertion

The most common failure message in native mobile testing is: cannot find selector.

That error is useless. It's like getting a call from someone who is blindfolded saying, "I can't find the door handle". You don't know what room they're in, let alone whether they're in the right house. Helping them becomes almost impossible without that extra context.

I fixed this by building a page/view registry:

Every screen in the app is registered.
After every navigation event (button tap, back swipe, deep link), assert: "Am I on the page I'm supposed to be on?"
On every test failure, print full "where am I" context.

Now when a test fails looking for a login button, I know immediately whether it's on the login screen, an error page, or somewhere completely unexpected. cannot find selector becomes expected: LoginScreen, actual: ErrorPage — which is actually actionable.

This also made the selector cascade anti-pattern unnecessary. You don't need fallback selectors if you know exactly where you are before you reach for one.

Expected/actual page, hint, and nav path collapse triage from 'find the cause' to 'check the auth flow'.

Act 3 — Make failures actionable, track everything

Now that the system is legible, failures need to mean something.

Principle 05

Failures should light a path to resolution

A failure report isn't done when it tells you what broke. It's done when it tells you where to look next.

In an LLM-assisted workflow, that means:

An "investigate this" prompt the developer can paste directly into an LLM.
Logs, screenshots, and error messages attached as context.
A full view-hierarchy dump (the mobile equivalent of a DOM dump) at point of failure.
Network requests and payloads, so an LLM can identify whether the failure was a frontend issue or a backend response problem.

The failure becomes a briefing, not a mystery.

auto-generated · paste into LLMInvestigate this E2E failure. Test: send_money_happy_path Failed at: [execution] step 4 — tap "Send" Expected: SendMoneyScreen Actual: SessionExpiredErrorPage Phase: execution Attached: - logs/run-1283.log (timestamped, per-stage) - screenshots/run-1283/before-tap.png - screenshots/run-1283/after-tap.png - view-hierarchy/run-1283.json - network/run-1283.har Hypothesis surface area: - test setup auth (token expiry?) - backend session refresh - frontend nav guard What's the most likely cause, and what would you check next?

Principle 06

Name and eradicate anti-patterns actively

Once the system is legible, the damaging patterns become visible. Name them. Block them in code review. I'm formalizing explicit team rules now:

No timeout hammer. Increasing a timeout is not a bug fix — it's a smell. PRs that bump timeouts without an attached root-cause require justification.
No selector cascades. Fallback-selector chains are banned. If a selector is unstable, fix the page registry, not the call site.

The goal: bake these into how the team talks about test quality, not just what reviewers look for in a diff.

Principle 07

Keep a ledger

Before this work, I had no way to track failure trends over time. No categories, no stage tagging, no history.

Now every run is logged with stage, category, reason, and links to logs and screenshots. That turns anecdotal "it feels flaky" into "68% of failures happen in test setup, and half of those are auth-related." The difference between guessing and diagnosing.

Run history · ten most recent

Scenario: checkout-flow happy-path — captured during one debugging session.

Date

Dur

Status

Build

Scenario

Failure

Apr 28 20:37

2m 32s

✓ PASS

9.4.444

checkout-flow happy-path

—

Apr 28 20:20

2m 42s

✗ FAIL

9.4.444

checkout-flow happy-path

Wrong Screen · Unknown screen

Apr 28 20:13

3m 01s

✗ FAIL

9.4.444

checkout-flow happy-path

Uncategorized · Checkout Screen

Apr 28 20:05

2m 29s

✗ FAIL

9.4.444

checkout-flow happy-path

Error Page · Non-Recoverable Error

Apr 28 19:51

3m 09s

✗ FAIL

—

checkout-flow happy-path

Element Not Found · Unknown screen

Apr 28 19:44

2m 51s

✗ FAIL

—

checkout-flow happy-path

Payment sheet didn't load · Unknown screen

Apr 28 19:26

—

⚠ CRASH

—

(no scenario)

—

Apr 28 19:20

3m 19s

✗ FAIL

—

checkout-flow happy-path

Element Not Found · Checkout Screen

Apr 28 19:12

2m 08s

✗ FAIL

—

checkout-flow happy-path

Error Page · Blank WebView

Apr 28 18:38

2m 18s

✗ FAIL

—

checkout-flow happy-path

Error Page · Blank WebView

Nine of ten runs failed. Three failure categories all surfaced the same root cause (auth/session in test setup) — invisible in the old wall-of-logs world.

The future: self-healing pipelines

Here's what becomes possible once failures are structured, contextualized, and LLM-readable: an agent that scans failed jobs and proposes fixes.

Not autonomously. The guardrails really matter here.

The incentives have to stay aligned.

An AI that makes tests green by deleting the assertions that catch real bugs is worse than no AI at all — it's an actively dangerous pipeline that projects false confidence.

The right model:

Agent proposes fixes via PR. Human approves.
Known failure patterns get auto-remediation unlocked over time.
Novel failure signatures get flagged for human review.

The foundation I built — phases, structured logs, page registry, contextual error reports — makes this possible. Without it, you're feeding an LLM noise. With it, you're giving it a clear brief.

Before / After: a real run

real run output

❯ ./run-e2e.sh --test <test-id>

==========================================

Test Configuration:

Flow: <flow>

Integration: API

==========================================

Checking iOS simulator status...

Found booted simulator: iPhone 17 Pro Max

> @org/e2e-runner@1.3.0 test

> wdio run ./wdio.conf.ts

[DEBUG] TAG_EXPRESSION: <test-id>

[DEBUG] isNoAppTest: false

[CONFIG] Running tests for platform: ios on simulator

[CONFIG] BrowserStack mode: DISABLED

[CONFIG] Release target: all

[CONFIG] Video recording: ENABLED

[CONFIG] Restart emulator per scenario: DISABLED

INFO @wdio/cli:launcher: Run onPrepare hook

WARN @wdio/config:ConfigParser: pattern did not match

Execution of 30 workers started

[0-0] [DEBUG] TAG_EXPRESSION: <test-id>

[0-0] [DEBUG] isNoAppTest: false

[0-0] [CONFIG] Running tests for platform: ios on simulator

[0-0] [CONFIG] BrowserStack mode: DISABLED

[0-0] [CONFIG] Release target: all

[0-0] [CONFIG] Video recording: ENABLED

[0-0] INFO @wdio/cli: SKIPPED iOS(26.2)

[0-0] INFO Run onWorkerEnd hook

[…] (~600 more lines: workers, retries, configs)

[FAIL] 1 of 1 scenarios FAILED

Error: cannot find element

Both runs heavily redacted. The shape is what matters: undifferentiated noise vs. phased, named, contextual output.

Before: ❯ ./run-e2e.sh --test <test-id> ========================================== Test Configuration: Flow: <flow> Integration: API ========================================== Checking iOS simulator status... Found booted simulator: iPhone 17 Pro Max > @org/e2e-runner@1.3.0 test > wdio run ./wdio.conf.ts [DEBUG] TAG_EXPRESSION: <test-id> [DEBUG] isNoAppTest: false [CONFIG] Running tests for platform: ios on simulator [CONFIG] BrowserStack mode: DISABLED [CONFIG] Release target: all [CONFIG] Video recording: ENABLED [CONFIG] Restart emulator per scenario: DISABLED INFO @wdio/cli:launcher: Run onPrepare hook WARN @wdio/config:ConfigParser: pattern did not match WARN @wdio/config:ConfigParser: pattern did not match Execution of 30 workers started [0-0] [DEBUG] TAG_EXPRESSION: <test-id> [0-0] [DEBUG] isNoAppTest: false [0-0] [CONFIG] Running tests for platform: ios on simulator [0-0] [CONFIG] BrowserStack mode: DISABLED [0-0] [CONFIG] Release target: all [0-0] [CONFIG] Video recording: ENABLED [0-0] INFO @wdio/cli: SKIPPED iOS(26.2) [0-0] INFO Run onWorkerEnd hook […] (~600 more lines: workers, retries, configs) [FAIL] 1 of 1 scenarios FAILED Error: cannot find element After: ❯ npm test Capturing environment snapshot... Xcode 26.4.1 Target: iPhone 17 Pro Max · iOS 26.1 App: app.example.checkout Running preflight checks... ✓ 9/9 checks passed Environment ready (8.2s) ━━━ SETUP ━━━ ▸ Session starting iPhone 17 Pro Max · iOS 26.1 Driver: XCUITest ━━━ SETUP complete (32.7s) ━━━ ━━━ APP-ENV ━━━ ▸ [action] configure app environment ✓ Already configured — verified ━━━ APP-ENV complete (47.6s) ━━━ ━━━ LOGIN ━━━ ▸ [action] ensure user is logged in ✓ Already logged in — skipping ━━━ LOGIN complete (5.9s) ━━━ ━━━ MERCHANT-CONFIG ━━━ ▸ [action] navigate to checkout ✓ Already Stage — verified ━━━ MERCHANT-CONFIG complete (30.3s) ━━━ ━━━ TEST ━━━ ▸ [action] tap pay button ✓ tapped (1.3s) ▸ [action] expect app foreground (app switch) ✗ Switch Not Triggered — expected: PayPal, actual: Merchant ──── FAILURE ──── Stage: TEST Category: App Switch Failure › Switch Not Triggered Owner: us Screen: Checkout (Merchant) Duration: 34.1s ──── COPY FOR LLM ──── /triage-test-failure <failure> run_id: <run-id> category: App Switch Failure › Switch Not Triggered attached: - screenshot, page_source.xml, video.mp4 - curated_log (app errors, GraphQL, auth flow) - trace_log (chronological actions) </failure> ✓ SETUP (32.7s) → ✓ APP-ENV (47.6s) → ✓ LOGIN (5.9s) → ✓ MERCHANT-CONFIG (30.3s) → ✗ TEST (41.7s) [FAIL] checkout-flow happy-path FAILED (run-id <run-id>)

The Before column above isn't a strawman. It's a real day, working a real flake — the wall of [CONFIG] and INFO @wdio lines hiding the one thing I actually needed. The After is the same scenario, same failure, after seven principles' worth of work.

Conclusion

The headline number — 13 minutes to 7 seconds — is real, and it's the wrong thing to focus on. Faster tests didn't fix the pipeline. They were a side effect of the actual fix: a system I could see into.

That's the lesson worth keeping. When you can't fix something, it's usually because you can't see it. The cycle of "increase the timeout, add a fallback selector, hope for the best" is what happens when a system has no way to tell you what it's doing. Give it phases, timestamps, page context, structured failure reports — and the same problems that felt unfixable last quarter become tractable engineering work this quarter.

A pipeline that communicates is one you can actually trust. Build that first; the speed and the AI assistance will follow.