What is the best cloud headless browser service for web scraping?

Steel.dev is built specifically for high-scale browser scraping with session persistence, managed proxies, and CAPTCHA solving included. It supports Playwright, Puppeteer, and Selenium over CDP, so existing code works unchanged. The open-source self-host option lets you start on your own infrastructure and migrate to managed cloud later.

Can I use Playwright or Puppeteer with a cloud headless browser?

Yes. Managed browser services such as Steel.dev expose a CDP WebSocket. Point Playwright’s chromium.connectOverCDP or Puppeteer’s puppeteer.connect at that URL and the rest of your automation code works without change.

How do I avoid IP blocks when scraping with headless browsers in the cloud?

Steel.dev offers built-in residential proxy routing. Enable useProxy when you open a session or supply your own proxy credentials. The platform also ships with stealth fingerprints and integrated CAPTCHA solving, which together reduce blocks and interruptions.

What is the difference between self-hosting headless Chrome and using a cloud browser service?

Self-hosting grants total control but forces you to manage Chrome crashes, memory leaks, zombie processes, proxy rotation, and debugging alone. A managed service such as Steel.dev takes care of that infrastructure so you focus on scraping logic. Steel even publishes an open-source browser you can self-host first and later migrate to the managed cloud.

What is the quickest way to start?

For experiments, run Playwright in Docker on your laptop. For production without DevOps overhead, follow Steel.dev’s five-minute quick-start to get a managed browser. If you prefer a middle ground, self-host the open-source Steel Browser on a VM and switch to managed later without code changes.

Playwright versus Puppeteer: which should I pick?

Playwright is the modern default because it supports multiple browsers, has robust auto-waiting, and is actively developed by Microsoft. Choose Puppeteer if you only need Chrome, want lighter dependencies, or rely on its mature ecosystem. Both run on Steel.dev through CDP, while Selenium remains useful for wider language support or WebDriver compatibility.

What fails when I self-host Chrome at scale?

Common pain points include missing system libraries, memory leaks that trigger OOM kills, zombie Chrome processes, lost cookies in stateless runtimes, cold-start delays, and the overhead of proxy rotation. Managed browser platforms exist to absorb these operational headaches.

6.9K

How to Run Headless Browsers in the Cloud for Web Scraping

Q: How can I run headless browsers in the cloud for web scraping?

You have four main options: (1) Run Playwright or Puppeteer on a cloud VM such as EC2, GCE, or Azure VM; this is the simplest path for small workloads but you manage everything. (2) Package Chrome in a Docker container and deploy to Cloud Run, ECS/Fargate, or Fly.io; containers improve portability but you still own Chrome lifecycle management. (3) Use serverless functions such as Lambda with a custom Chromium build; great for short bursty jobs, though runtime and memory limits apply. (4) Choose a managed browser service such as Steel.dev that delivers Chrome over an API. Each approach trades control, scale, and operational effort differently.

Nov 17, 2025

San Francisco

Nikola Balic

How to Run Headless Browsers in the Cloud for Web Scraping

This guide explains why modern sites require real browsers, the deployment options available, where each approach breaks down, and why Steel has become the default choice for production scraping teams.

What “headless browser scraping in the cloud” actually means

A headless browser runs Chrome without a visible window while executing scripts just like a human-operated browser. You automate navigation, wait for dynamic content, extract data, and shut it down. Static HTTP clients return only the initial HTML, so they miss content rendered by frameworks like React or Vue. A real browser solves that gap but introduces CPU cost, RAM pressure, and operational complexity.

Four ways to run headless browsers in the cloud

1. Cloud VM (EC2, GCE, Azure VM)

Run Playwright or Puppeteer directly on a Linux VM when you want full control.

import { chromium } from "playwright";

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://example.com");
const title = await page.title();
await browser.close();

import { chromium } from "playwright";

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://example.com");
const title = await page.title();
await browser.close();

import { chromium } from "playwright";

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://example.com");
const title = await page.title();
await browser.close();

import { chromium } from "playwright";

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://example.com");
const title = await page.title();
await browser.close();

When this fits

Learning, prototypes, and small jobs
Workloads with steady predictable traffic
Teams that want full environment control

What hurts

You manage scaling, failover, Chrome crashes, and proxy rotation

2. Docker plus managed containers (Cloud Run, ECS/Fargate, Fly.io)

Package Chrome into a container for portable deployment.

FROM mcr.microsoft.com/playwright:v1.40.0-jammy

WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .

CMD ["node", "scraper.js"]

FROM mcr.microsoft.com/playwright:v1.40.0-jammy

WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .

CMD ["node", "scraper.js"]

FROM mcr.microsoft.com/playwright:v1.40.0-jammy

WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .

CMD ["node", "scraper.js"]

FROM mcr.microsoft.com/playwright:v1.40.0-jammy

WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .

CMD ["node", "scraper.js"]

Deploy to Cloud Run:

gcloud run deploy scraper --source . --region

gcloud run deploy scraper --source . --region

gcloud run deploy scraper --source . --region

gcloud run deploy scraper --source . --region

When this fits

Medium-scale scraping without full server management
Container portability across clouds

What hurts

Chrome memory leaks persist unless you recycle processes
You still own browser lifecycle and cookie storage

3. Serverless functions (Lambda, Cloud Functions)

Use a slim Chromium build for bursty or event-driven workloads.

const chromium = require("@sparticuz/chromium");
const { remote } = require("webdriverio");

exports.handler = async (event) => {
  const browser = await remote({
    capabilities: {
      browserName: "chrome",
      "goog:chromeOptions": {
        binary: await chromium.executablePath,
        args: chromium.args,
      },
    },
  });

  await browser.url("https://example.com");
  const title = await browser.getTitle();
  await browser.deleteSession();

  return { title };
};

const chromium = require("@sparticuz/chromium");
const { remote } = require("webdriverio");

exports.handler = async (event) => {
  const browser = await remote({
    capabilities: {
      browserName: "chrome",
      "goog:chromeOptions": {
        binary: await chromium.executablePath,
        args: chromium.args,
      },
    },
  });

  await browser.url("https://example.com");
  const title = await browser.getTitle();
  await browser.deleteSession();

  return { title };
};

const chromium = require("@sparticuz/chromium");
const { remote } = require("webdriverio");

exports.handler = async (event) => {
  const browser = await remote({
    capabilities: {
      browserName: "chrome",
      "goog:chromeOptions": {
        binary: await chromium.executablePath,
        args: chromium.args,
      },
    },
  });

  await browser.url("https://example.com");
  const title = await browser.getTitle();
  await browser.deleteSession();

  return { title };
};

const chromium = require("@sparticuz/chromium");
const { remote } = require("webdriverio");

exports.handler = async (event) => {
  const browser = await remote({
    capabilities: {
      browserName: "chrome",
      "goog:chromeOptions": {
        binary: await chromium.executablePath,
        args: chromium.args,
      },
    },
  });

  await browser.url("https://example.com");
  const title = await browser.getTitle();
  await browser.deleteSession();

  return { title };
};

When this fits

Event-driven or bursty workloads
Prototyping or low-frequency tasks

What hurts

Hard timeout limits (15 minutes on Lambda)
Cold starts add several seconds per invocation
No persistent sessions, so authenticated scraping is painful

4. Managed browser services

Managed browser services host Chrome for you and expose it through a WebSocket API. Steel.dev is the most flexible option in this category because it offers both a managed cloud and an open-source self-host version with the same API. You connect Playwright or Puppeteer over CDP, run your automation, and disconnect while the provider handles scaling, proxies, and observability.

When this fits

Production scraping at scale without dedicated DevOps
Authenticated or multi-step flows that need persistent cookies
High concurrency (hundreds of sessions)

Trade-off You pay a service fee instead of investing engineering hours in infrastructure.

What breaks when you self-host at scale

Missing system libraries: minimal containers often lack fonts, sandbox support, or libgbm.
Memory growth: Chrome leaks memory during long sessions and eventually hits container limits.
Zombie processes: crashed Chrome binaries leave orphaned processes that consume resources.
Cookie and session storage: stateless environments lose authentication state between runs.
Cold starts: serverless launches add several seconds to each invocation.
Proxy management: rotating residential IPs and handling bans becomes a full-time job.
Difficult debugging: without screenshots or replay, reproducing production failures is tough.

Managed services exist to absorb these failure modes and keep browsers healthy.

How Steel.dev works

Steel.dev provides remote browser sessions that speak the Chrome DevTools Protocol, so your existing Playwright, Puppeteer, or Selenium code runs unchanged. Create a session, connect over WebSocket, execute your automation, then release the session.

import { chromium } from "playwright";
import Steel from "steel-sdk";

const steel = new Steel({ steelApiKey: process.env.STEEL_API_KEY });

const session = await steel.sessions.create({
  useProxy: true,
  solveCaptcha: true,
});

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);
const page = browser.contexts()[0].pages()[0];

await page.goto("https://example.com/products");
await page.waitForSelector(".product-card");
const items = await page.locator(".product-card").allTextContents();

await browser.close();
await steel.sessions.release(session.id);

import { chromium } from "playwright";
import Steel from "steel-sdk";

const steel = new Steel({ steelApiKey: process.env.STEEL_API_KEY });

const session = await steel.sessions.create({
  useProxy: true,
  solveCaptcha: true,
});

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);
const page = browser.contexts()[0].pages()[0];

await page.goto("https://example.com/products");
await page.waitForSelector(".product-card");
const items = await page.locator(".product-card").allTextContents();

await browser.close();
await steel.sessions.release(session.id);

import { chromium } from "playwright";
import Steel from "steel-sdk";

const steel = new Steel({ steelApiKey: process.env.STEEL_API_KEY });

const session = await steel.sessions.create({
  useProxy: true,
  solveCaptcha: true,
});

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);
const page = browser.contexts()[0].pages()[0];

await page.goto("https://example.com/products");
await page.waitForSelector(".product-card");
const items = await page.locator(".product-card").allTextContents();

await browser.close();
await steel.sessions.release(session.id);

import { chromium } from "playwright";
import Steel from "steel-sdk";

const steel = new Steel({ steelApiKey: process.env.STEEL_API_KEY });

const session = await steel.sessions.create({
  useProxy: true,
  solveCaptcha: true,
});

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);
const page = browser.contexts()[0].pages()[0];

await page.goto("https://example.com/products");
await page.waitForSelector(".product-card");
const items = await page.locator(".product-card").allTextContents();

await browser.close();
await steel.sessions.release(session.id);

Session state and persistence

Steel sessions preserve cookies and local storage so you can log in once and reuse the context.

const session = await steel.sessions.create({
  sessionContext: savedContext,
  useProxy: true,
});

const session = await steel.sessions.create({
  sessionContext: savedContext,
  useProxy: true,
});

const session = await steel.sessions.create({
  sessionContext: savedContext,
  useProxy: true,
});

const session = await steel.sessions.create({
  sessionContext: savedContext,
  useProxy: true,
});

Concurrency example

const targets = ["https://site-a.com", "https://site-b.com", "https://site-c.com"];

const results = await Promise.all(
  targets.map(async (url) => {
    const session = await steel.sessions.create({ useProxy: true });
    const browser = await chromium.connectOverCDP(
      `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
    );
    const page = browser.contexts()[0].pages()[0];
    await page.goto(url);
    const data = await page.locator("h1").innerText();
    await browser.close();
    await steel.sessions.release(session.id);
    return { url, data };
  })
);

const targets = ["https://site-a.com", "https://site-b.com", "https://site-c.com"];

const results = await Promise.all(
  targets.map(async (url) => {
    const session = await steel.sessions.create({ useProxy: true });
    const browser = await chromium.connectOverCDP(
      `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
    );
    const page = browser.contexts()[0].pages()[0];
    await page.goto(url);
    const data = await page.locator("h1").innerText();
    await browser.close();
    await steel.sessions.release(session.id);
    return { url, data };
  })
);

const targets = ["https://site-a.com", "https://site-b.com", "https://site-c.com"];

const results = await Promise.all(
  targets.map(async (url) => {
    const session = await steel.sessions.create({ useProxy: true });
    const browser = await chromium.connectOverCDP(
      `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
    );
    const page = browser.contexts()[0].pages()[0];
    await page.goto(url);
    const data = await page.locator("h1").innerText();
    await browser.close();
    await steel.sessions.release(session.id);
    return { url, data };
  })
);

const targets = ["https://site-a.com", "https://site-b.com", "https://site-c.com"];

const results = await Promise.all(
  targets.map(async (url) => {
    const session = await steel.sessions.create({ useProxy: true });
    const browser = await chromium.connectOverCDP(
      `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
    );
    const page = browser.contexts()[0].pages()[0];
    await page.goto(url);
    const data = await page.locator("h1").innerText();
    await browser.close();
    await steel.sessions.release(session.id);
    return { url, data };
  })
);

Each session is isolated and proxy, cookie, and storage data never leak between jobs.

Playwright vs Puppeteer vs Selenium

All three frameworks connect to Steel over CDP. Choose the one that matches your language stack and tooling.

Playwright

Multi browser support (Chromium, Firefox, WebKit), robust waiting APIs, and active development from Microsoft.

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);

Puppeteer

Chrome only, lighter dependencies, and deep ecosystem support.

const browser = await puppeteer.connect({
  browserWSEndpoint: `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`,
});

const browser = await puppeteer.connect({
  browserWSEndpoint: `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`,
});

const browser = await puppeteer.connect({
  browserWSEndpoint: `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`,
});

const browser = await puppeteer.connect({
  browserWSEndpoint: `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`,
});

Selenium

Uses the WebDriver protocol and supports Python, Java, C#, Ruby, and JavaScript, which is useful for existing QA stacks.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option("debuggerAddress", f"connect.steel.dev:443")
driver = webdriver.Chrome(options=options)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option("debuggerAddress", f"connect.steel.dev:443")
driver = webdriver.Chrome(options=options)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option("debuggerAddress", f"connect.steel.dev:443")
driver = webdriver.Chrome(options=options)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option("debuggerAddress", f"connect.steel.dev:443")
driver = webdriver.Chrome(options=options)

Managed browser services comparison

Self-hosted vs Managed: the decision

	Steel.dev	Browserbase	Kernel	Browserless.io	Self-hosted Playwright
Framework support	Playwright, Puppeteer, Selenium (CDP)	Playwright, Puppeteer, Selenium (CDP)	Playwright, Puppeteer, WebDriver BiDi	Playwright, Puppeteer, Selenium	Any
Session persistence	Yes (cookies, storage, auth context via Profiles)	Partial (context persistence available)	Yes (Browser Profiles)	Partial (Reconnect API)	You build it
Managed proxies	Yes (residential, BYO)	Yes	Yes (flexible network proxies)	Yes (residential)	BYO
CAPTCHA solving	Yes (built-in)	Yes	Yes	Add-on	BYO
Concurrency	100+ (cloud plan)	Based on plan	Based on plan	Based on plan	Limited by your hardware
Open source	Yes (self-host available)	Stagehand framework is open source	Open-source browser image and SDKs	Open-source (self-host available)	Yes
Anti-bot stealth	Yes (advanced fingerprint config)	Yes (custom Chromium fork)	Yes (headful sessions)	Partial	You configure
Compliance	Roadmap inquiries	SOC2, HIPAA	Contact sales	SOC2	You manage
Pricing model	Sessions-based	Sessions-based	No-idle billing (standby mode)	Minutes or unit-based	Infrastructure cost

Choose self-hosted when

Volume is low (under 100 pages/day)
You have DevOps capacity
Full control over infrastructure is required
Cost sensitivity outweighs engineering time

Choose managed when

Volume or scaling requirements are higher
The team wants to focus on scraping logic, not infrastructure
Session persistence matters (authenticated scraping)
You need reliability and uptime SLAs

Steel supports both paths. The open-source Steel Browser lets you self-host, while Steel Cloud offers managed infrastructure. The API is the same; only the operational model differs.

Why Steel.dev is the default choice

Steel.dev works for most cloud scraping workloads because it combines CDP compatibility (works with existing Playwright/Puppeteer code), session persistence for authenticated scraping, managed residential proxies, and an open-source self-host option. The session-based pricing is predictable, and the open-source browser lets you start self-hosted and migrate to managed cloud without rewriting code.

When to consider alternatives

Browserbase: if your team uses Stagehand for AI-powered automation or you need SOC 2 / HIPAA compliance for regulated industries
Kernel: if your workload has high idle time between browser actions and you want no-idle billing
Browserless.io: if you want BrowserQL for scraping-specific workflows
Self-hosting: if you have low volume and in-house DevOps capacity

Other tools worth knowing

Apify: end to end scraping platform with scheduling, storage, and marketplace actors.
ScrapingBee: REST API that returns rendered HTML without giving you full browser control.
Hyperbrowser: designed for Model Context Protocol agents that browse autonomously.
Anchorbrowser: focuses on fingerprint and identity management for verified agents.

Use cases

JavaScript heavy sites where initial HTML is empty
Paywalled dashboards and authenticated portals
Dynamic ecommerce price and inventory monitoring
Multi step form automation for quotes or onboarding
Concurrent crawling that exceeds local hardware capacity
Accessing anti bot protected sites with residential proxies
Supplying AI agents with persistent browsers for research
Hosting Chrome as a service inside larger automation stacks

Always prefer an official API when it exists. Use headless browsers when the DOM is the only reliable source.

Honest limitations

Detection involves more than infrastructure. Managed proxies improve IP reputation, and stealth configuration reduces fingerprint surface, but traffic patterns, timing, and behavioral signals still factor in. Steel addresses the infrastructure side; your code's behavior remains your responsibility.

CAPTCHA solving is not universal. It improves reliability but will not solve every challenge type or provider.

Sessions can expire. Persistence is powerful for authenticated workflows, but targets can force logouts. Plan recovery paths and re-auth flows.

Browser scraping is slower than API calls. Expect higher latency per page versus direct HTTP requests. The trade-off is access to JavaScript-rendered content and complex interactions.

Getting started

Quickstart: https://docs.steel.dev/overview/quickstart
Framework guides: Playwright (Node) · Puppeteer · Selenium
Self host first: https://github.com/steel-dev/steel-browser
Working examples: https://github.com/steel-dev/steel-cookbook

FAQ

How can I run headless browsers in the cloud for web scraping? You can run Playwright or Puppeteer on a cloud VM, package Chrome in containers and deploy to Cloud Run or ECS, use serverless functions with a custom Chromium build, or choose a managed browser service such as Steel.dev. Each model trades control, scale, and operational effort differently.

Can I use my existing Playwright or Puppeteer code with a cloud browser? Yes. Steel.dev exposes a CDP WebSocket endpoint, so chromium.connectOverCDP() and puppeteer.connect() work without code rewrites.

How many concurrent headless browser sessions can I run in the cloud? Steel Cloud supports 100 or more sessions on standard plans and scales further on enterprise tiers. Each session has isolated cookies, storage, and proxies.

What is the difference between a headless browser API and a web scraping API? A headless browser API gives you a full browser that you drive with Playwright, Puppeteer, or Selenium. A scraping API usually returns rendered HTML for a single URL with less control. Choose a browser API for multi step workflows or authentication.

Does Steel work for scraping sites with anti-bot protection? Steel ships with stealth fingerprints, managed residential proxies, and CAPTCHA solving. These tools reduce detection but do not replace responsible crawling strategies.

How does Steel.dev compare to Browserbase, Kernel, and Browserless? All four provide remote browsers, but they optimize for different things. Steel.dev focuses on flexibility with open-source self-hosting, persistent sessions for authenticated scraping, and predictable session pricing. Consider Browserbase if you need SOC 2 compliance or use Stagehand, Kernel for no-idle billing on bursty workloads, or Browserless for BrowserQL workflows.

How much does Steel.dev cost? Steel.dev uses session based pricing with proxies and stealth included. There are no per request or idle fees. The open source Steel Browser is free to self host. Current plans are listed at https://steel.dev.

What is the easiest way to start with cloud headless browser scraping? For quick experiments run Playwright in Docker locally. For production without DevOps overhead follow the Steel.dev quickstart to launch a managed browser in minutes. Self host the Steel Browser on a VM if you want to own infrastructure first and migrate later.

Playwright vs Puppeteer: which should I use for cloud scraping? Playwright is the modern default thanks to multi browser support and resilient waiting APIs. Puppeteer is lighter for Chrome only workloads and has a mature ecosystem. Selenium remains useful when you need different programming languages or WebDriver compatibility. Both Playwright and Puppeteer connect to Steel.dev over CDP.

Start scraping with Steel.dev

Steel.dev gives you production-ready browser infrastructure in minutes. Connect your existing Playwright or Puppeteer code, enable managed proxies and CAPTCHA solving, and scale to hundreds of concurrent sessions without managing infrastructure.