Why Browser Agents Fail in Production

Mar 11, 2026

San Francisco

Nikola Balic

Stop evaluating browser stacks by asking "does it support CDP?" or "can it click?"

Ask this instead: Can this system execute reliably when you have real load, real state, and real failure modes?

That's the line between a browser automation demo and a production agent workflow.

CDP and WebDriver are not the story

Yes, WebDriver or CDP-adjacent signals can factor into bot detection. But treating detection as a single flag is a category error.

Anti-bot systems are layered. They combine network reputation, fingerprints, behavior, and challenge flows. You can "fix" one surface signal and still fail because your traffic looks wrong, your timing is unnatural, or your recovery behavior is inconsistent.

CDP is not a hack. It's a control surface used by real tooling. Playwright and Puppeteer speak to browsers through CDP under the hood. The protocol you use is usually not the root cause of production flakiness.

Execution is.

The commodity layer: actions

The feature checklist most teams start with is now table stakes:

open, click, fill, screenshot
Navigation + waits
File download/upload
Headless or remote modes

Most open source stacks and hosted browsers can do this. If that's all you need, use the simplest tool that fits your language and team.

The hard layer: scale

At low volume, you can brute-force success: rerun failures manually, patch selectors on the fly, babysit logins, accept flaky waits.

At 100, 1,000, or 10,000 concurrent runs, tiny issues compound:

Session churn: Logins expire. Cookies drift. Storage corrupts. Targets deploy UI changes mid-run.

Flaky recovery: A retry that works once can amplify failures under concurrency.

Long-tail latency: p95 and p99 step times dominate throughput and cost, not averages.

Debugging at 2 a.m.: If you can't reproduce what happened, you can't fix it quickly.

Multi-agent workflows raise the stakes. When many agents run in parallel, queuing and latency variance become correctness problems. One slow step cascades into timeouts, missed checkpoints, and session resets across the fleet.

This is why the space feels commoditized from the outside. The primitives look the same. The operational behavior is not.

What this looks like in practice

A 12-step workflow: log in, navigate a dashboard, hit a challenge gate, wait on a slow-loading report, extract data, and continue. It fails at step 7.

In a commodity stack, you restart the entire run. The agent re-does the login, re-navigates, re-waits, and might fail at step 7 again for the same reason.

In an execution layer, you inspect the session artifacts to see why step 7 failed, resume from preserved session state, and skip the six steps that already succeeded. This becomes especially powerful when agents are driving the browser. Steel exposes session-level logs and preserved session state, so an agent can detect failures, choose a recovery path, and continue without a full restart.

What to buy (and what not to)

You're not buying "browser control." You're buying an execution layer.

A production-grade agent browser platform should give you:

Deterministic session lifecycle: Create, attach, hand off, resume, terminate. Cleanly and predictably.

Resumability: Recover from partial failure without redoing the entire workflow.

Reviewability: Traces, artifacts, and structured outputs so humans can verify outcomes.

Explicit contracts: Tools that fail loudly with typed errors and predictable retries.

Concurrency you can reason about: Backpressure, limits, and stable performance under load.

That's the layer Steel is built around.

Steel provides cloud Sessions with persistent state, resumability, isolation, and observability so agents can recover, hand off work, and run predictably under concurrency. The goal isn't "more APIs." It's fewer 3 a.m. pages and faster recovery when the web changes.

Honest limitations

Steel won't magically solve detection if your workflow looks robotic. It won't bypass captchas without challenge-solving infra. It won't turn bad selectors into resilient ones.

What it will do:

When a site changes, you can see exactly what broke.
When a step slows down, you can measure it.
When a run fails halfway through, you can resume instead of restart.
When you scale concurrency, behavior remains predictable.

If you're running a handful of workflows, a standard automation library run locally is often the right call. You'll move fast and manual reruns are fine. Steel's open source browser and cloud platform share the same API, so you can start local and scale to managed sessions when the constraint shifts to session churn, concurrency, and latency. No rewrite required.

Next step

Before choosing a browser stack, take one production workflow and run it 100 times at your target concurrency. Measure p95/p99 step latency, restart rate, manual interventions, and time-to-diagnose.

That's where the differences show up.