Contact
Stealth

Bygga en stealth-skrapningsinfrastruktur som fungerar

Empirium Team12 min read

Most scraping tutorials show you how to extract data from a single page. Real operations need to extract data from millions of pages, consistently, without getting blocked, for months or years.

The difference between a scraping script and a scraping infrastructure is the difference between a campfire and a power plant. This guide covers the architecture, the detection avoidance, the error handling, and the operational discipline that makes large-scale scraping sustainable.

The Architecture of Production Scraping

Production scraping infrastructure has five distinct layers. Most operators build the first two and wonder why everything breaks at scale.

Layer 1: Queue Management

Every URL to scrape enters a queue. The queue manages priority, deduplication, retry scheduling, and rate limiting per domain.

URL Source → Queue (Redis/RabbitMQ) → Worker Pool → Proxy Pool → Target
                ↓                         ↓            ↓
            Scheduler              Error Handler   Health Checker
                ↓                         ↓
            Retry Queue              Alert System

Redis works for queues up to 10M items. Beyond that, consider RabbitMQ or Kafka for durability and back-pressure management. The queue must support:

  • Priority levels. High-value targets get scraped first. Retry items get lower priority than fresh items.
  • Domain-based rate limiting. No more than N requests per minute to any single domain, enforced at the queue level.
  • Deduplication. The same URL shouldn't be scraped twice within a configurable window.
  • Dead letter queue. URLs that fail after max retries move to a dead letter queue for manual review.

Layer 2: Worker Pool

Workers pull URLs from the queue, route them through the proxy pool, make the request, parse the response, and push extracted data to the data pipeline.

Workers should be stateless and horizontally scalable. Run them in containers (Docker) orchestrated by Kubernetes or a simpler alternative (Docker Compose for small operations, Nomad for medium).

Each worker runs one of two modes:

HTTP mode — direct HTTP requests using a library like curl-impersonate or a TLS-configured HTTP client. Faster, cheaper, and appropriate for targets that don't require JavaScript rendering. Handles 50-200 requests/second per worker.

Browser mode — headless Chrome/Firefox rendering JavaScript-heavy pages. Slower, more resource-intensive (each browser instance uses 200-500MB RAM), but necessary for SPAs and sites that require JS execution. Handles 2-10 pages/second per worker.

Layer 3: Proxy Pool

The proxy layer is covered in detail in our IP rotation guide and proxy comparison. For scraping specifically:

  • Pool size minimum: 5x your concurrent worker count. If you run 50 workers, maintain 250+ proxies.
  • Per-domain proxy affinity. Assign proxy subsets to specific domains to avoid cross-contamination of IP reputation.
  • Automatic blacklist detection. When a proxy starts returning CAPTCHAs or blocks on a specific domain, remove it from that domain's pool immediately.

Layer 4: Data Pipeline

Raw scraped data goes through: parsing → validation → normalization → deduplication → storage.

Parsing extracts structured data from HTML/JSON. Use CSS selectors or XPath for HTML; JSONPath for APIs. Define parsers as configuration (not code) so they can be updated without redeployment when target site structures change.

Validation ensures extracted data meets expected schemas. A price field should be numeric. A product name should be non-empty. A URL should be valid. Invalid data gets flagged for parser review.

Storage: PostgreSQL for structured data under 100M rows. ClickHouse or BigQuery for larger analytical datasets. S3-compatible storage for raw HTML archival (useful for debugging parser issues).

Layer 5: Monitoring

The monitoring layer tracks: request success rates, block rates per domain, proxy health, data quality metrics, throughput, and cost per successful extraction.

Alert thresholds:

  • Block rate exceeds 20% on any domain → reduce rate, rotate proxies
  • Success rate drops below 70% → parser may be broken (site structure change)
  • Proxy health drops below 80% → provider issue, switch pools
  • Data quality score drops → field extraction broken, review parsers

Request Pattern Design

Detection systems look for patterns that distinguish bots from humans. The goal is making your request patterns statistically indistinguishable from organic traffic.

Rate Limiting

Don't send requests at a constant rate. Real users don't browse at exactly 1 page every 3 seconds. Implement variable delays:

import random
import time

def human_delay(base_seconds=2.0):
    jitter = random.gauss(0, base_seconds * 0.3)
    delay = max(0.5, base_seconds + jitter)
    time.sleep(delay)

Per-domain limits: 10-30 requests/minute for aggressive scraping, 2-5 requests/minute for stealth operations. Vary by target — Amazon tolerates less than a blog.

Session Management

Real users maintain sessions. They don't send isolated requests — they browse in sequences. Model this:

  1. Start with a homepage or search page visit
  2. Navigate through 3-7 pages in a session
  3. Maintain cookies and referrer headers across the session
  4. End sessions naturally (idle timeout, not abrupt disconnect)

Referrer Chains

Every request should have a plausible referrer. A product page request without a prior category page visit is suspicious. Build referrer chains that match natural navigation:

google.com/search?q=... → target.com/search?q=... → target.com/product/123

Header Completeness

Real browsers send 15-25 headers per request. Bots send 3-5. Match your headers to a real browser profile:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: same-origin
Sec-Ch-Ua: "Chromium";v="120", "Google Chrome";v="120"
Sec-Ch-Ua-Mobile: ?0
Sec-Ch-Ua-Platform: "Windows"

Missing Sec-Fetch-* headers immediately identify non-browser clients. Missing Sec-Ch-Ua headers flag non-Chrome clients.

Error Handling and Recovery

At scale, errors aren't exceptions — they're a constant background noise. The infrastructure must handle them without human intervention.

CAPTCHA Strategy

When a CAPTCHA appears:

  1. Backoff. The CAPTCHA means detection is triggered. Continuing from the same proxy intensifies detection.
  2. Rotate proxy. Switch to a different IP for the next attempt.
  3. Solve if necessary. CAPTCHA solving services (2Captcha, CapSolver, AntiCaptcha) cost $1-3 per 1000 CAPTCHAs. Only use them when the data value justifies the cost.
  4. Track CAPTCHA rates. If CAPTCHA rates exceed 10%, your request patterns need adjustment — solving CAPTCHAs is treating symptoms, not the disease.

Block Recovery

When requests return 403, 429, or connection resets:

  • 429 (Rate Limited): Respect the Retry-After header. If no header, exponential backoff with 30-second base.
  • 403 (Forbidden): Proxy is likely blacklisted for this domain. Rotate to a new proxy. Don't retry on the same proxy.
  • Connection reset: Could be network issue or active blocking. Retry once on same proxy, then rotate.
  • Soft blocks: Pages that load but show different content (no prices, "please log in" prompts, degraded data). These are harder to detect programmatically — validate extracted data against expected schemas.

Cascading Failure Prevention

One blocked domain shouldn't cascade to block everything. Implement circuit breakers per domain:

  • If a domain blocks 50% of requests in 5 minutes → pause that domain for 30 minutes
  • If a domain blocks 80% of requests → pause for 2 hours and rotate entire proxy subset
  • If all proxies fail on a domain → alert for manual investigation (site may have changed detection, structure, or gone down)

Data Quality at Scale

Scraped data is only useful if it's accurate. At scale, data quality degrades silently.

Schema enforcement. Define expected output schemas and validate every extraction. A product without a price, a listing without a title, or a review without a rating should be flagged.

Structural change detection. Target sites change their HTML structure without warning. Implement automated detection: if the extraction success rate for a specific field drops below 90%, the parser is likely broken. Alert immediately — stale data is worse than no data.

Deduplication. Scraping the same item multiple times across runs is inevitable. Deduplicate on natural keys (URL, product ID, listing ID) and merge changes over time.

Legal and Ethical Boundaries

Web scraping operates in a legal landscape that varies by jurisdiction and evolves through case law.

In the US: The hiQ Labs v. LinkedIn (2022) decision established that scraping publicly accessible data does not violate the CFAA. However, this ruling specifically applies to public data — scraping behind login walls or circumventing access controls changes the legal analysis.

In the EU: GDPR applies when scraped data includes personal information. Names, email addresses, phone numbers, and photos are personal data. Scraping and processing them requires a legal basis — legitimate interest is the most common, but it requires documentation and balance testing.

robots.txt is not legally binding in most jurisdictions, but ignoring it can be used as evidence of bad faith in legal proceedings. Respect it where practical; document business justifications when you don't.

Terms of service violations are contract breaches, not criminal acts (see our ethics framework). However, they give the platform grounds for legal action if they choose to pursue it.

FAQ

Headless browsers or HTTP clients for scraping? HTTP clients for 80% of targets. They're 10-50x faster, use 90% less resources, and handle most static and server-rendered content. Use headless browsers only when JavaScript rendering is required — SPAs, infinite scroll pages, or content loaded via AJAX. Test each target with an HTTP client first; only escalate to browser rendering when extraction fails.

How do I handle JavaScript-heavy sites? Playwright in headed mode with a virtual display (Xvfb on Linux) combines JavaScript rendering with lower detection than headless mode. For maximum stealth, see headless browser detection. Alternatively, reverse-engineer the site's API calls using browser DevTools Network tab and call the APIs directly — this is faster and stealthier than rendering pages.

What does scraping actually cost per request? Ranges from $0.001 (HTTP mode, datacenter proxy) to $0.05 (browser mode, residential proxy with CAPTCHA solving). At 1M pages/day: $1,000-50,000/month depending on target sophistication. The biggest variable is proxy cost — optimize for the minimum proxy quality that avoids detection.

Is there an API alternative to scraping? Sometimes. Many sites offer APIs (official or unofficial) that provide structured data more reliably than HTML scraping. Check for: official public APIs, undocumented APIs visible in browser DevTools, and data provider services that aggregate scraped data as a service. APIs are faster, more stable, and less likely to trigger legal issues.

Written by Empirium Team

Explore More

Deep-dive into related topics across our five pillars.

Pillar Guide

Webbläsarfingeravtryck 2026: Vad operatörer behöver veta

A technical breakdown of how platforms identify browsers through fingerprinting, the 12 vectors they use, and what actually works to defend against it.

View all Stealth articles

Related Resources

Need help with this?

Talk to Empirium