Softplorer Logo

Proxy Guide

How to Build a Scraping Pipeline

A scraping pipeline is not a loop that sends requests through a proxy. It is a system with components that handle IP rotation, session management, retry logic, monitoring, and data extraction independently — each component replaceable when it becomes the constraint.

In practice

  • Proxy layer: handles IP assignment, rotation policy, and geo-targeting ✔
  • Request layer: handles TLS fingerprint, headers, and behavioral timing ✔
  • Session layer: manages sticky/rotating configuration per workflow type ✔
  • Retry layer: classifies responses and retries only retriable failures ✔
  • Monitoring layer: tracks success rate over time — not just at start ✔

A pipeline built as a monolith fails as a monolith. A pipeline built as components fails at the component that's the current constraint — which is diagnosable and replaceable.

Overview

Every scraping setup is a pipeline whether it's architected as one or not. The request leaves the client, routes through a proxy, arrives at the target, and returns a response. Each stage can be the point of failure: the proxy IP can be blocked, the request can fail TLS fingerprinting, the session can break from IP rotation, the retry logic can amplify bad requests, or the monitoring can fail to detect degradation until it's critical. Building each stage as an explicit, observable component makes the failure point identifiable.

Operators who build a single scraping script that handles proxy, request, session, retry, and output in one function discover they can't isolate failure causes — and can't replace individual components when one becomes the constraint.

How to think about it

The proxy layer is responsible for IP assignment, rotation policy, and geographic targeting. Its output to the pipeline is a gateway endpoint and session credentials. The rest of the pipeline should not contain proxy-specific logic — if the proxy configuration needs to change (different pool, different rotation policy, different provider), the change happens in the proxy layer without touching request handling, session management, or retry logic.

Proxy layer configuration parameters the pipeline should expose as explicit settings: pool type (residential/datacenter/ISP), rotation mode (per-request/sticky), session duration for sticky mode, geographic targeting parameters, and concurrency limit. These are tunable independently of the rest of the pipeline. When block rate increases, the proxy layer configuration is the first variable to adjust — not the request structure or the retry logic.

Proxy layer health monitoring: track connection success rate (successful proxy connection vs connection failures), and separately track target response success rate (HTTP 200 with expected content vs blocks, CAPTCHAs, and errors). Separating these two metrics identifies whether a failure is at the proxy connection level or at the target response level. A high proxy connection success rate with low target response success rate means the proxy is working; the target is blocking. A low proxy connection success rate means the proxy itself has a configuration or infrastructure problem.

How it works

The request layer is responsible for constructing requests that match the target's expectations beyond the IP: TLS fingerprint configuration, request headers (User-Agent, Accept-Language, Accept-Encoding, Referer), request timing (inter-request jitter, rate limiting compliance), and resource loading behavior (whether to fetch associated resources alongside the primary target). This layer is independent of the proxy layer — the same request configuration applies regardless of which proxy IP delivers it. Changes to this layer don't require changes to proxy configuration.

The retry layer classifies responses into retriable and non-retriable failures. Retriable failures: connection timeouts, proxy connection errors, target-side 503s (temporary server unavailability), and CAPTCHA responses that will be re-evaluated on a fresh IP with a delay. Non-retriable failures: 404s (target URL doesn't exist), authentication failures (credentials invalid, not IP-related), and permanent bans (account-level or subnet-level blocks that won't clear on retry). Retrying non-retriable failures amplifies load against blocked endpoints without improving yield — and in some cases accelerates IP reputation degradation.

Retry strategy for retriable failures: exponential backoff with jitter between retries, new IP assignment on each retry (new session ID for sticky mode or per-request rotation for rotating mode), and a maximum retry count before the item is flagged for manual review or dropped. Retrying indefinitely on a target that is applying a sustained block creates a loop that consumes bandwidth and accelerates pool contamination without producing data.

Where it breaks

Pipelines that mix stateless and stateful requests in the same queue against the same target require session management logic that assigns rotation vs sticky configuration per request type — not globally. A pipeline that uses per-request rotation globally fails authenticated requests. A pipeline that uses sticky sessions globally concentrates stateless requests on fewer IPs and reintroduces rate limiting where rotation would have prevented it. The session layer must route request types to the correct configuration.

Session isolation for concurrent authenticated operations: each authenticated session must use a separate session ID — and therefore a separate sticky IP — to prevent session cross-contamination. Concurrent sessions sharing a sticky IP on a target that binds session cookies to IP means one session's authentication state can interfere with another's. Session IDs should be generated per workflow execution, not reused across executions.

Session warm-up for targets that build trust through session history: some targets apply lower challenge rates to sessions that have established a browsing history — pages viewed, time on site, navigation sequences. A pipeline that goes directly to high-value extraction endpoints without warm-up activity triggers challenges that a session with prior browsing history would not. Implementing configurable warm-up request sequences before extraction requests reduces challenge rate on sensitive targets.

In context

Success rate time series: sample success rate at regular intervals (every 5–15 minutes) throughout the scraping run. A stable success rate indicates the pipeline is operating within the pool's capacity and the target's detection thresholds. A declining success rate indicates pool degradation. The alert threshold should trigger at the point where the decline, if it continues at the observed rate, will produce unacceptable yield before the run completes — not after the run is already below threshold.

Block response classification: categorize non-200 responses by type — CAPTCHA, 403 (likely ASN or IP reputation), 429 (rate limiting), 503 (temporary). Different block types indicate different causes. A shift from 403s to 429s without configuration change indicates the target's rate limiting threshold has been reached — not that IP reputation has degraded. Different response types require different interventions.

Proxy layer metrics separately from target response metrics: proxy connection success rate, average connection latency, exit IP distribution (to verify rotation is working and pool depth is sufficient), and geo distribution if targeting is configured. These metrics distinguish proxy infrastructure problems from target detection problems — which require different fixes.

Choose your path

Each component should be replaceable independently: change the proxy provider without touching the request layer; change the TLS stack without touching the proxy layer; adjust retry logic without affecting session management. This architecture makes the response to each failure mode surgical — replace the failing component, not the whole pipeline.

  • Define proxy layer, request layer, session layer, retry layer, and monitoring layer as separate components
  • Track success rate continuously — alert on decline, not on absolute threshold breach
  • Classify block responses by type — each type has a different cause and fix
  • Assign session IDs per workflow execution, not globally
  • Configure retry on retriable failures only — retry loops on non-retriable failures accelerate degradation
Proxy setup for scaling — pipeline architecture for high-volume operationsProxy setup failure modes — which component each failure pattern points toProxy providers for scraping — evaluated by pipeline integration requirements