How to Run Headless Browsers in the Cloud for Web Scraping

How to Run Headless Browsers in the Cloud for Web Scraping

Nov 17, 2025

Nov 17, 2025

/

San Francisco

/

Dane Wilson

Nikola Balic

Nikola Balic

How to Run Headless Browsers in the Cloud for Web Scraping

This guide explains why modern sites require real browsers, the deployment options available, where each approach breaks down, and why Steel has become the default choice for production scraping teams.

What “headless browser scraping in the cloud” actually means

A headless browser runs Chrome without a visible window while executing scripts just like a human-operated browser. You automate navigation, wait for dynamic content, extract data, and shut it down. Static HTTP clients return only the initial HTML, so they miss content rendered by frameworks like React or Vue. A real browser solves that gap but introduces CPU cost, RAM pressure, and operational complexity.

Four ways to run headless browsers in the cloud

1. Cloud VM (EC2, GCE, Azure VM)

Run Playwright or Puppeteer directly on a Linux VM when you want full control.

import { chromium } from "playwright";

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://example.com");
const title = await page.title();
await browser.close();
import { chromium } from "playwright";

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://example.com");
const title = await page.title();
await browser.close();
import { chromium } from "playwright";

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://example.com");
const title = await page.title();
await browser.close();
import { chromium } from "playwright";

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://example.com");
const title = await page.title();
await browser.close();

When this fits

  • Learning, prototypes, and small jobs

  • Workloads with steady predictable traffic

  • Teams that want full environment control

What hurts

  • You manage scaling, failover, Chrome crashes, and proxy rotation

2. Docker plus managed containers (Cloud Run, ECS/Fargate, Fly.io)

Package Chrome into a container for portable deployment.

FROM mcr.microsoft.com/playwright:v1.40.0-jammy

WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .

CMD ["node", "scraper.js"]
FROM mcr.microsoft.com/playwright:v1.40.0-jammy

WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .

CMD ["node", "scraper.js"]
FROM mcr.microsoft.com/playwright:v1.40.0-jammy

WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .

CMD ["node", "scraper.js"]
FROM mcr.microsoft.com/playwright:v1.40.0-jammy

WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .

CMD ["node", "scraper.js"]

Deploy to Cloud Run:

gcloud run deploy scraper --source . --region
gcloud run deploy scraper --source . --region
gcloud run deploy scraper --source . --region
gcloud run deploy scraper --source . --region

When this fits

  • Medium-scale scraping without full server management

  • Container portability across clouds

What hurts

  • Chrome memory leaks persist unless you recycle processes

  • You still own browser lifecycle and cookie storage

3. Serverless functions (Lambda, Cloud Functions)

Use a slim Chromium build for bursty or event-driven workloads.

const chromium = require("@sparticuz/chromium");
const { remote } = require("webdriverio");

exports.handler = async (event) => {
  const browser = await remote({
    capabilities: {
      browserName: "chrome",
      "goog:chromeOptions": {
        binary: await chromium.executablePath,
        args: chromium.args,
      },
    },
  });

  await browser.url("https://example.com");
  const title = await browser.getTitle();
  await browser.deleteSession();

  return { title };
};
const chromium = require("@sparticuz/chromium");
const { remote } = require("webdriverio");

exports.handler = async (event) => {
  const browser = await remote({
    capabilities: {
      browserName: "chrome",
      "goog:chromeOptions": {
        binary: await chromium.executablePath,
        args: chromium.args,
      },
    },
  });

  await browser.url("https://example.com");
  const title = await browser.getTitle();
  await browser.deleteSession();

  return { title };
};
const chromium = require("@sparticuz/chromium");
const { remote } = require("webdriverio");

exports.handler = async (event) => {
  const browser = await remote({
    capabilities: {
      browserName: "chrome",
      "goog:chromeOptions": {
        binary: await chromium.executablePath,
        args: chromium.args,
      },
    },
  });

  await browser.url("https://example.com");
  const title = await browser.getTitle();
  await browser.deleteSession();

  return { title };
};
const chromium = require("@sparticuz/chromium");
const { remote } = require("webdriverio");

exports.handler = async (event) => {
  const browser = await remote({
    capabilities: {
      browserName: "chrome",
      "goog:chromeOptions": {
        binary: await chromium.executablePath,
        args: chromium.args,
      },
    },
  });

  await browser.url("https://example.com");
  const title = await browser.getTitle();
  await browser.deleteSession();

  return { title };
};

When this fits

  • Event-driven or bursty workloads

  • Prototyping or low-frequency tasks

What hurts

  • Hard timeout limits (15 minutes on Lambda)

  • Cold starts add several seconds per invocation

  • No persistent sessions, so authenticated scraping is painful

4. Managed browser services

Managed browser services host Chrome for you and expose it through a WebSocket API. Steel.dev is the most flexible option in this category because it offers both a managed cloud and an open-source self-host version with the same API. You connect Playwright or Puppeteer over CDP, run your automation, and disconnect while the provider handles scaling, proxies, and observability.

When this fits

  • Production scraping at scale without dedicated DevOps

  • Authenticated or multi-step flows that need persistent cookies

  • High concurrency (hundreds of sessions)

Trade-off You pay a service fee instead of investing engineering hours in infrastructure.

What breaks when you self-host at scale

  1. Missing system libraries: minimal containers often lack fonts, sandbox support, or libgbm.

  2. Memory growth: Chrome leaks memory during long sessions and eventually hits container limits.

  3. Zombie processes: crashed Chrome binaries leave orphaned processes that consume resources.

  4. Cookie and session storage: stateless environments lose authentication state between runs.

  5. Cold starts: serverless launches add several seconds to each invocation.

  6. Proxy management: rotating residential IPs and handling bans becomes a full-time job.

  7. Difficult debugging: without screenshots or replay, reproducing production failures is tough.

Managed services exist to absorb these failure modes and keep browsers healthy.

How Steel.dev works

Steel.dev provides remote browser sessions that speak the Chrome DevTools Protocol, so your existing Playwright, Puppeteer, or Selenium code runs unchanged. Create a session, connect over WebSocket, execute your automation, then release the session.

import { chromium } from "playwright";
import Steel from "steel-sdk";

const steel = new Steel({ steelApiKey: process.env.STEEL_API_KEY });

const session = await steel.sessions.create({
  useProxy: true,
  solveCaptcha: true,
});

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);
const page = browser.contexts()[0].pages()[0];

await page.goto("https://example.com/products");
await page.waitForSelector(".product-card");
const items = await page.locator(".product-card").allTextContents();

await browser.close();
await steel.sessions.release(session.id);
import { chromium } from "playwright";
import Steel from "steel-sdk";

const steel = new Steel({ steelApiKey: process.env.STEEL_API_KEY });

const session = await steel.sessions.create({
  useProxy: true,
  solveCaptcha: true,
});

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);
const page = browser.contexts()[0].pages()[0];

await page.goto("https://example.com/products");
await page.waitForSelector(".product-card");
const items = await page.locator(".product-card").allTextContents();

await browser.close();
await steel.sessions.release(session.id);
import { chromium } from "playwright";
import Steel from "steel-sdk";

const steel = new Steel({ steelApiKey: process.env.STEEL_API_KEY });

const session = await steel.sessions.create({
  useProxy: true,
  solveCaptcha: true,
});

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);
const page = browser.contexts()[0].pages()[0];

await page.goto("https://example.com/products");
await page.waitForSelector(".product-card");
const items = await page.locator(".product-card").allTextContents();

await browser.close();
await steel.sessions.release(session.id);
import { chromium } from "playwright";
import Steel from "steel-sdk";

const steel = new Steel({ steelApiKey: process.env.STEEL_API_KEY });

const session = await steel.sessions.create({
  useProxy: true,
  solveCaptcha: true,
});

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);
const page = browser.contexts()[0].pages()[0];

await page.goto("https://example.com/products");
await page.waitForSelector(".product-card");
const items = await page.locator(".product-card").allTextContents();

await browser.close();
await steel.sessions.release(session.id);

Session state and persistence

Steel sessions preserve cookies and local storage so you can log in once and reuse the context.

const session = await steel.sessions.create({
  sessionContext: savedContext,
  useProxy: true,
});
const session = await steel.sessions.create({
  sessionContext: savedContext,
  useProxy: true,
});
const session = await steel.sessions.create({
  sessionContext: savedContext,
  useProxy: true,
});
const session = await steel.sessions.create({
  sessionContext: savedContext,
  useProxy: true,
});

Concurrency example

const targets = ["https://site-a.com", "https://site-b.com", "https://site-c.com"];

const results = await Promise.all(
  targets.map(async (url) => {
    const session = await steel.sessions.create({ useProxy: true });
    const browser = await chromium.connectOverCDP(
      `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
    );
    const page = browser.contexts()[0].pages()[0];
    await page.goto(url);
    const data = await page.locator("h1").innerText();
    await browser.close();
    await steel.sessions.release(session.id);
    return { url, data };
  })
);
const targets = ["https://site-a.com", "https://site-b.com", "https://site-c.com"];

const results = await Promise.all(
  targets.map(async (url) => {
    const session = await steel.sessions.create({ useProxy: true });
    const browser = await chromium.connectOverCDP(
      `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
    );
    const page = browser.contexts()[0].pages()[0];
    await page.goto(url);
    const data = await page.locator("h1").innerText();
    await browser.close();
    await steel.sessions.release(session.id);
    return { url, data };
  })
);
const targets = ["https://site-a.com", "https://site-b.com", "https://site-c.com"];

const results = await Promise.all(
  targets.map(async (url) => {
    const session = await steel.sessions.create({ useProxy: true });
    const browser = await chromium.connectOverCDP(
      `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
    );
    const page = browser.contexts()[0].pages()[0];
    await page.goto(url);
    const data = await page.locator("h1").innerText();
    await browser.close();
    await steel.sessions.release(session.id);
    return { url, data };
  })
);
const targets = ["https://site-a.com", "https://site-b.com", "https://site-c.com"];

const results = await Promise.all(
  targets.map(async (url) => {
    const session = await steel.sessions.create({ useProxy: true });
    const browser = await chromium.connectOverCDP(
      `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
    );
    const page = browser.contexts()[0].pages()[0];
    await page.goto(url);
    const data = await page.locator("h1").innerText();
    await browser.close();
    await steel.sessions.release(session.id);
    return { url, data };
  })
);

Each session is isolated and proxy, cookie, and storage data never leak between jobs.

Playwright vs Puppeteer vs Selenium

All three frameworks connect to Steel over CDP. Choose the one that matches your language stack and tooling.

Playwright

Multi browser support (Chromium, Firefox, WebKit), robust waiting APIs, and active development from Microsoft.

const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);
const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);
const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);
const browser = await chromium.connectOverCDP(
  `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`
);

Puppeteer

Chrome only, lighter dependencies, and deep ecosystem support.

const browser = await puppeteer.connect({
  browserWSEndpoint: `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`,
});
const browser = await puppeteer.connect({
  browserWSEndpoint: `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`,
});
const browser = await puppeteer.connect({
  browserWSEndpoint: `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`,
});
const browser = await puppeteer.connect({
  browserWSEndpoint: `wss://connect.steel.dev?apiKey=${process.env.STEEL_API_KEY}&sessionId=${session.id}`,
});

Selenium

Uses the WebDriver protocol and supports Python, Java, C#, Ruby, and JavaScript, which is useful for existing QA stacks.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option("debuggerAddress", f"connect.steel.dev:443")
driver = webdriver.Chrome(options=options)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option("debuggerAddress", f"connect.steel.dev:443")
driver = webdriver.Chrome(options=options)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option("debuggerAddress", f"connect.steel.dev:443")
driver = webdriver.Chrome(options=options)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option("debuggerAddress", f"connect.steel.dev:443")
driver = webdriver.Chrome(options=options)

Managed browser services comparison

Self-hosted vs Managed: the decision


Steel.dev

Browserbase

Kernel

Browserless.io

Self-hosted Playwright

Framework support

Playwright, Puppeteer, Selenium (CDP)

Playwright, Puppeteer, Selenium (CDP)

Playwright, Puppeteer, WebDriver BiDi

Playwright, Puppeteer, Selenium

Any

Session persistence

Yes (cookies, storage, auth context via Profiles)

Partial (context persistence available)

Yes (Browser Profiles)

Partial (Reconnect API)

You build it

Managed proxies

Yes (residential, BYO)

Yes

Yes (flexible network proxies)

Yes (residential)

BYO

CAPTCHA solving

Yes (built-in)

Yes

Yes

Add-on

BYO

Concurrency

100+ (cloud plan)

Based on plan

Based on plan

Based on plan

Limited by your hardware

Open source

Yes (self-host available)

Stagehand framework is open source

Open-source browser image and SDKs

Open-source (self-host available)

Yes

Anti-bot stealth

Yes (advanced fingerprint config)

Yes (custom Chromium fork)

Yes (headful sessions)

Partial

You configure

Compliance

Roadmap inquiries

SOC2, HIPAA

Contact sales

SOC2

You manage

Pricing model

Sessions-based

Sessions-based

No-idle billing (standby mode)

Minutes or unit-based

Infrastructure cost

Choose self-hosted when

  • Volume is low (under 100 pages/day)

  • You have DevOps capacity

  • Full control over infrastructure is required

  • Cost sensitivity outweighs engineering time

Choose managed when

  • Volume or scaling requirements are higher

  • The team wants to focus on scraping logic, not infrastructure

  • Session persistence matters (authenticated scraping)

  • You need reliability and uptime SLAs

Steel supports both paths. The open-source Steel Browser lets you self-host, while Steel Cloud offers managed infrastructure. The API is the same; only the operational model differs.

Why Steel.dev is the default choice

Steel.dev works for most cloud scraping workloads because it combines CDP compatibility (works with existing Playwright/Puppeteer code), session persistence for authenticated scraping, managed residential proxies, and an open-source self-host option. The session-based pricing is predictable, and the open-source browser lets you start self-hosted and migrate to managed cloud without rewriting code.

When to consider alternatives

  • Browserbase: if your team uses Stagehand for AI-powered automation or you need SOC 2 / HIPAA compliance for regulated industries

  • Kernel: if your workload has high idle time between browser actions and you want no-idle billing

  • Browserless.io: if you want BrowserQL for scraping-specific workflows

  • Self-hosting: if you have low volume and in-house DevOps capacity

Other tools worth knowing

  • Apify: end to end scraping platform with scheduling, storage, and marketplace actors.

  • ScrapingBee: REST API that returns rendered HTML without giving you full browser control.

  • Hyperbrowser: designed for Model Context Protocol agents that browse autonomously.

  • Anchorbrowser: focuses on fingerprint and identity management for verified agents.

Use cases

  • JavaScript heavy sites where initial HTML is empty

  • Paywalled dashboards and authenticated portals

  • Dynamic ecommerce price and inventory monitoring

  • Multi step form automation for quotes or onboarding

  • Concurrent crawling that exceeds local hardware capacity

  • Accessing anti bot protected sites with residential proxies

  • Supplying AI agents with persistent browsers for research

  • Hosting Chrome as a service inside larger automation stacks

Always prefer an official API when it exists. Use headless browsers when the DOM is the only reliable source.

Honest limitations

Detection involves more than infrastructure. Managed proxies improve IP reputation, and stealth configuration reduces fingerprint surface, but traffic patterns, timing, and behavioral signals still factor in. Steel addresses the infrastructure side; your code's behavior remains your responsibility.

CAPTCHA solving is not universal. It improves reliability but will not solve every challenge type or provider.

Sessions can expire. Persistence is powerful for authenticated workflows, but targets can force logouts. Plan recovery paths and re-auth flows.

Browser scraping is slower than API calls. Expect higher latency per page versus direct HTTP requests. The trade-off is access to JavaScript-rendered content and complex interactions.

Getting started

FAQ

How can I run headless browsers in the cloud for web scraping? You can run Playwright or Puppeteer on a cloud VM, package Chrome in containers and deploy to Cloud Run or ECS, use serverless functions with a custom Chromium build, or choose a managed browser service such as Steel.dev. Each model trades control, scale, and operational effort differently.

Can I use my existing Playwright or Puppeteer code with a cloud browser? Yes. Steel.dev exposes a CDP WebSocket endpoint, so chromium.connectOverCDP() and puppeteer.connect() work without code rewrites.

How many concurrent headless browser sessions can I run in the cloud? Steel Cloud supports 100 or more sessions on standard plans and scales further on enterprise tiers. Each session has isolated cookies, storage, and proxies.

What is the difference between a headless browser API and a web scraping API? A headless browser API gives you a full browser that you drive with Playwright, Puppeteer, or Selenium. A scraping API usually returns rendered HTML for a single URL with less control. Choose a browser API for multi step workflows or authentication.

Does Steel work for scraping sites with anti-bot protection? Steel ships with stealth fingerprints, managed residential proxies, and CAPTCHA solving. These tools reduce detection but do not replace responsible crawling strategies.

How does Steel.dev compare to Browserbase, Kernel, and Browserless? All four provide remote browsers, but they optimize for different things. Steel.dev focuses on flexibility with open-source self-hosting, persistent sessions for authenticated scraping, and predictable session pricing. Consider Browserbase if you need SOC 2 compliance or use Stagehand, Kernel for no-idle billing on bursty workloads, or Browserless for BrowserQL workflows.

How much does Steel.dev cost? Steel.dev uses session based pricing with proxies and stealth included. There are no per request or idle fees. The open source Steel Browser is free to self host. Current plans are listed at https://steel.dev.

What is the easiest way to start with cloud headless browser scraping? For quick experiments run Playwright in Docker locally. For production without DevOps overhead follow the Steel.dev quickstart to launch a managed browser in minutes. Self host the Steel Browser on a VM if you want to own infrastructure first and migrate later.

Playwright vs Puppeteer: which should I use for cloud scraping? Playwright is the modern default thanks to multi browser support and resilient waiting APIs. Puppeteer is lighter for Chrome only workloads and has a mature ecosystem. Selenium remains useful when you need different programming languages or WebDriver compatibility. Both Playwright and Puppeteer connect to Steel.dev over CDP.

Start scraping with Steel.dev

Steel.dev gives you production-ready browser infrastructure in minutes. Connect your existing Playwright or Puppeteer code, enable managed proxies and CAPTCHA solving, and scale to hundreds of concurrent sessions without managing infrastructure.

Ready to

Build with Steel?

Ready to

Build with Steel?

Ready to

Build with Steel?

Ready to Build with Steel?

A better way to take your LLMs online.

© Steel · Inc. 2025.

All Systems Operational

Platform

Join the community