/
San Francisco
/


How to Run Headless Browsers in the Cloud for Web Scraping
This guide explains why modern sites require real browsers, the deployment options available, where each approach breaks down, and why Steel has become the default choice for production scraping teams.
What “headless browser scraping in the cloud” actually means
A headless browser runs Chrome without a visible window while executing scripts just like a human-operated browser. You automate navigation, wait for dynamic content, extract data, and shut it down. Static HTTP clients return only the initial HTML, so they miss content rendered by frameworks like React or Vue. A real browser solves that gap but introduces CPU cost, RAM pressure, and operational complexity.
Four ways to run headless browsers in the cloud
1. Cloud VM (EC2, GCE, Azure VM)
Run Playwright or Puppeteer directly on a Linux VM when you want full control.
When this fits
Learning, prototypes, and small jobs
Workloads with steady predictable traffic
Teams that want full environment control
What hurts
You manage scaling, failover, Chrome crashes, and proxy rotation
2. Docker plus managed containers (Cloud Run, ECS/Fargate, Fly.io)
Package Chrome into a container for portable deployment.
Deploy to Cloud Run:
When this fits
Medium-scale scraping without full server management
Container portability across clouds
What hurts
Chrome memory leaks persist unless you recycle processes
You still own browser lifecycle and cookie storage
3. Serverless functions (Lambda, Cloud Functions)
Use a slim Chromium build for bursty or event-driven workloads.
When this fits
Event-driven or bursty workloads
Prototyping or low-frequency tasks
What hurts
Hard timeout limits (15 minutes on Lambda)
Cold starts add several seconds per invocation
No persistent sessions, so authenticated scraping is painful
4. Managed browser services
Managed browser services host Chrome for you and expose it through a WebSocket API. Steel.dev is the most flexible option in this category because it offers both a managed cloud and an open-source self-host version with the same API. You connect Playwright or Puppeteer over CDP, run your automation, and disconnect while the provider handles scaling, proxies, and observability.
When this fits
Production scraping at scale without dedicated DevOps
Authenticated or multi-step flows that need persistent cookies
High concurrency (hundreds of sessions)
Trade-off You pay a service fee instead of investing engineering hours in infrastructure.
What breaks when you self-host at scale
Missing system libraries: minimal containers often lack fonts, sandbox support, or
libgbm.Memory growth: Chrome leaks memory during long sessions and eventually hits container limits.
Zombie processes: crashed Chrome binaries leave orphaned processes that consume resources.
Cookie and session storage: stateless environments lose authentication state between runs.
Cold starts: serverless launches add several seconds to each invocation.
Proxy management: rotating residential IPs and handling bans becomes a full-time job.
Difficult debugging: without screenshots or replay, reproducing production failures is tough.
Managed services exist to absorb these failure modes and keep browsers healthy.
How Steel.dev works
Steel.dev provides remote browser sessions that speak the Chrome DevTools Protocol, so your existing Playwright, Puppeteer, or Selenium code runs unchanged. Create a session, connect over WebSocket, execute your automation, then release the session.
Session state and persistence
Steel sessions preserve cookies and local storage so you can log in once and reuse the context.
Concurrency example
Each session is isolated and proxy, cookie, and storage data never leak between jobs.
Playwright vs Puppeteer vs Selenium
All three frameworks connect to Steel over CDP. Choose the one that matches your language stack and tooling.
Playwright
Multi browser support (Chromium, Firefox, WebKit), robust waiting APIs, and active development from Microsoft.
Puppeteer
Chrome only, lighter dependencies, and deep ecosystem support.
Selenium
Uses the WebDriver protocol and supports Python, Java, C#, Ruby, and JavaScript, which is useful for existing QA stacks.
Managed browser services comparison
Self-hosted vs Managed: the decision
Steel.dev | Browserbase | Kernel | Browserless.io | Self-hosted Playwright | |
|---|---|---|---|---|---|
Framework support | Playwright, Puppeteer, Selenium (CDP) | Playwright, Puppeteer, Selenium (CDP) | Playwright, Puppeteer, WebDriver BiDi | Playwright, Puppeteer, Selenium | Any |
Session persistence | Yes (cookies, storage, auth context via Profiles) | Partial (context persistence available) | Yes (Browser Profiles) | Partial (Reconnect API) | You build it |
Managed proxies | Yes (residential, BYO) | Yes | Yes (flexible network proxies) | Yes (residential) | BYO |
CAPTCHA solving | Yes (built-in) | Yes | Yes | Add-on | BYO |
Concurrency | 100+ (cloud plan) | Based on plan | Based on plan | Based on plan | Limited by your hardware |
Open source | Yes (self-host available) | Stagehand framework is open source | Open-source browser image and SDKs | Open-source (self-host available) | Yes |
Anti-bot stealth | Yes (advanced fingerprint config) | Yes (custom Chromium fork) | Yes (headful sessions) | Partial | You configure |
Compliance | Roadmap inquiries | SOC2, HIPAA | Contact sales | SOC2 | You manage |
Pricing model | Sessions-based | Sessions-based | No-idle billing (standby mode) | Minutes or unit-based | Infrastructure cost |
Choose self-hosted when
Volume is low (under 100 pages/day)
You have DevOps capacity
Full control over infrastructure is required
Cost sensitivity outweighs engineering time
Choose managed when
Volume or scaling requirements are higher
The team wants to focus on scraping logic, not infrastructure
Session persistence matters (authenticated scraping)
You need reliability and uptime SLAs
Steel supports both paths. The open-source Steel Browser lets you self-host, while Steel Cloud offers managed infrastructure. The API is the same; only the operational model differs.
Why Steel.dev is the default choice
Steel.dev works for most cloud scraping workloads because it combines CDP compatibility (works with existing Playwright/Puppeteer code), session persistence for authenticated scraping, managed residential proxies, and an open-source self-host option. The session-based pricing is predictable, and the open-source browser lets you start self-hosted and migrate to managed cloud without rewriting code.
When to consider alternatives
Browserbase: if your team uses Stagehand for AI-powered automation or you need SOC 2 / HIPAA compliance for regulated industries
Kernel: if your workload has high idle time between browser actions and you want no-idle billing
Browserless.io: if you want BrowserQL for scraping-specific workflows
Self-hosting: if you have low volume and in-house DevOps capacity
Other tools worth knowing
Apify: end to end scraping platform with scheduling, storage, and marketplace actors.
ScrapingBee: REST API that returns rendered HTML without giving you full browser control.
Hyperbrowser: designed for Model Context Protocol agents that browse autonomously.
Anchorbrowser: focuses on fingerprint and identity management for verified agents.
Use cases
JavaScript heavy sites where initial HTML is empty
Paywalled dashboards and authenticated portals
Dynamic ecommerce price and inventory monitoring
Multi step form automation for quotes or onboarding
Concurrent crawling that exceeds local hardware capacity
Accessing anti bot protected sites with residential proxies
Supplying AI agents with persistent browsers for research
Hosting Chrome as a service inside larger automation stacks
Always prefer an official API when it exists. Use headless browsers when the DOM is the only reliable source.
Honest limitations
Detection involves more than infrastructure. Managed proxies improve IP reputation, and stealth configuration reduces fingerprint surface, but traffic patterns, timing, and behavioral signals still factor in. Steel addresses the infrastructure side; your code's behavior remains your responsibility.
CAPTCHA solving is not universal. It improves reliability but will not solve every challenge type or provider.
Sessions can expire. Persistence is powerful for authenticated workflows, but targets can force logouts. Plan recovery paths and re-auth flows.
Browser scraping is slower than API calls. Expect higher latency per page versus direct HTTP requests. The trade-off is access to JavaScript-rendered content and complex interactions.
Getting started
Quickstart: https://docs.steel.dev/overview/quickstart
Framework guides: Playwright (Node) · Puppeteer · Selenium
Self host first: https://github.com/steel-dev/steel-browser
Working examples: https://github.com/steel-dev/steel-cookbook
FAQ
How can I run headless browsers in the cloud for web scraping? You can run Playwright or Puppeteer on a cloud VM, package Chrome in containers and deploy to Cloud Run or ECS, use serverless functions with a custom Chromium build, or choose a managed browser service such as Steel.dev. Each model trades control, scale, and operational effort differently.
Can I use my existing Playwright or Puppeteer code with a cloud browser? Yes. Steel.dev exposes a CDP WebSocket endpoint, so chromium.connectOverCDP() and puppeteer.connect() work without code rewrites.
How many concurrent headless browser sessions can I run in the cloud? Steel Cloud supports 100 or more sessions on standard plans and scales further on enterprise tiers. Each session has isolated cookies, storage, and proxies.
What is the difference between a headless browser API and a web scraping API? A headless browser API gives you a full browser that you drive with Playwright, Puppeteer, or Selenium. A scraping API usually returns rendered HTML for a single URL with less control. Choose a browser API for multi step workflows or authentication.
Does Steel work for scraping sites with anti-bot protection? Steel ships with stealth fingerprints, managed residential proxies, and CAPTCHA solving. These tools reduce detection but do not replace responsible crawling strategies.
How does Steel.dev compare to Browserbase, Kernel, and Browserless? All four provide remote browsers, but they optimize for different things. Steel.dev focuses on flexibility with open-source self-hosting, persistent sessions for authenticated scraping, and predictable session pricing. Consider Browserbase if you need SOC 2 compliance or use Stagehand, Kernel for no-idle billing on bursty workloads, or Browserless for BrowserQL workflows.
How much does Steel.dev cost? Steel.dev uses session based pricing with proxies and stealth included. There are no per request or idle fees. The open source Steel Browser is free to self host. Current plans are listed at https://steel.dev.
What is the easiest way to start with cloud headless browser scraping? For quick experiments run Playwright in Docker locally. For production without DevOps overhead follow the Steel.dev quickstart to launch a managed browser in minutes. Self host the Steel Browser on a VM if you want to own infrastructure first and migrate later.
Playwright vs Puppeteer: which should I use for cloud scraping? Playwright is the modern default thanks to multi browser support and resilient waiting APIs. Puppeteer is lighter for Chrome only workloads and has a mature ecosystem. Selenium remains useful when you need different programming languages or WebDriver compatibility. Both Playwright and Puppeteer connect to Steel.dev over CDP.
Start scraping with Steel.dev
Steel.dev gives you production-ready browser infrastructure in minutes. Connect your existing Playwright or Puppeteer code, enable managed proxies and CAPTCHA solving, and scale to hundreds of concurrent sessions without managing infrastructure.
Get started: https://docs.steel.dev/overview/quickstart
Self-host first: https://github.com/steel-dev/steel-browser
Examples and recipes: https://github.com/steel-dev/steel-cookbook
All Systems Operational

