Diagnose why you are getting 403
Before changing anything, find out which signal is triggering the block -- the fix for a header problem is useless against an IP problem. Read the response body and headers first, because they usually name the cause.
- Read the body. A short plain "Forbidden" or "Access Denied" page is typically a basic WAF (web application firewall, simple edge pattern-matching). A branded challenge page with a ray ID points to a dedicated anti-bot detection service.
- Read the headers. Look for
cf-ray(Cloudflare),x-amzn-waf-action(AWS WAF),server: AkamaiGHost(Akamai), or anx-datadomecookie. These tell you what you are up against. - Isolate the variable. Run the exact same URL three ways: with plain
requests, with full browser headers, and through a proxy in the target's main country. If headers fix it, the cause was your header set. If only the proxy fixes it, the cause was IP reputation or geo. If nothing fixes it, the cause is your TLS fingerprint -- a signal that plain Python clients cannot change.
The 403 is only the symptom. The real decision was made one layer earlier, based on the cheapest signal the server could check, so identify that signal before you start editing code.
Fix it step by step (headers, sessions, TLS)
Work through these layers in order. Most 403s on permitted sites are solved by the first two; the rest need a real browser fingerprint.
- Send a real User-Agent. The default
python-requests/2.xstring is the single most common cause of a 403 that works fine in a browser. Copy a current Chrome or Firefox User-Agent and use it verbatim. - Send the full browser header set. A request with only a User-Agent still looks nothing like a browser. Add
Accept,Accept-Language,Accept-Encoding, and a plausibleReferer. Browsers also send Client Hints (sec-ch-ua,sec-fetch-*); matching those helps on stricter sites. - Persist cookies with a session. Many sites set a cookie on first load and 403 any follow-up request that arrives without it. A
requests.Session()(orcurl_cffi.requests.Session()) carries those cookies forward automatically. - Match the TLS handshake. If 403s survive perfect headers and cookies, the server is fingerprinting your TLS/JA3 and HTTP/2 settings -- the handshake reveals "Python" before the server reads a single header. Plain
requestscannot change this. Switch to curl_cffi withimpersonate="chrome", which replays a real browser's TLS, JA3, and HTTP/2 frame order. Useimpersonate="chrome124"to pin a specific version, or"chrome"to track the latest.
When the fix is the IP: proxies and pacing
If headers, cookies, and TLS impersonation all check out but you still see 403, the block is about where the request comes from and how fast the requests arrive.
- IP reputation. Datacenter IP ranges (cloud servers, hosting providers) are widely flagged because real users do not browse from them. If the block is tied to datacenter IP ranges, testing the same request from a different network -- for example a residential proxy -- helps confirm whether IP reputation, rather than your client, was the cause. Pass proxies to curl_cffi as
proxies={"https": "http://user:pass@host:port"}. - Geo restrictions. Some sites only serve specific countries. If the body mentions your region, choose a proxy located in the target's main market.
- Rate and pattern. A burst of identical requests from one IP reads as automation. Add a delay between requests, randomize it slightly, and and keep per-IP request volume low and well-spaced so your traffic stays within a reasonable, polite rate for the site. Slowing down is often the cheapest fix of all.
- Stop retrying blindly. Repeated 403s from the same identity reinforce the block. On a 403, change a signal (IP, headers, or fingerprint) before the next attempt; never loop on the same request.
Stitching real browser headers, TLS impersonation, a residential proxy pool, retries, and pacing together by hand is a lot of moving parts. A managed web-data API such as Scrappey rolls the browser fingerprint, proxy rotation, and retry logic into a single request, which is one way to consolidate all of the layers above when you would rather not maintain them yourself.
