A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests. Cloudflare changes their techniques periodically, so I will update this repo frequently.
一个简单的 Python 模块,用于绕过 Cloudflare 的反机器人页面(也称为“I'm Under Attack Mode”,或 IUAM),通过 Requests 实现。Cloudflare 会定期更改他们的技术,因此我会经常更新此存储库。
This can be useful if you wish to scrape or crawl a website protected with Cloudflare. Cloudflare's anti-bot page currently just checks if the client supports Javascript, though they may add additional techniques in the future.
如果您希望抓取或抓取受 Cloudflare 保护的网站,这可能很有用。Cloudflare 的反机器人页面目前只检查客户端是否支持 Javascript,尽管他们将来可能会添加其他技术。
Due to Cloudflare continually changing and hardening their protection page, cloudscraper requires a JavaScript Engine/interpreter to solve Javascript challenges. This allows the script to easily impersonate a regular web browser without explicitly deobfuscating and parsing Cloudflare's Javascript.
由于 Cloudflare 不断更改和强化其保护页面,Cloudscraper 需要一个 JavaScript 引擎/解释器来解决 Javascript 挑战。这允许脚本轻松模拟常规 Web 浏览器,而无需显式地对 Cloudflare 的 Javascript 进行反混淆和解析。
For reference, this is the default message Cloudflare uses for these sorts of pages:
作为参考,这是 Cloudflare 用于这些类型页面的默认消息:
Checking your browser before accessing website.com.
This process is automatic. Your browser will redirect to your requested content shortly.
Please allow up to 5 seconds...
Any script using cloudscraper will sleep for ~5 seconds for the first visit to any site with Cloudflare anti-bots enabled, though no delay will occur after the first request.
任何使用 cloudscraper 的脚本在首次访问启用了 Cloudflare 反机器人的站点时都将休眠 ~5 秒,但在第一次请求后不会发生延迟。
If you feel like showing your love and/or appreciation for this project, then how about shouting me a coffee or beer :)
如果您想表达对这个项目的热爱和/或赞赏,那么大声喊我一杯咖啡或啤酒怎么样:)
Simply run pip install cloudscraper
. The PyPI package is at https://pypi.python.org/pypi/cloudscraper/
只需运行 pip install cloudscraper
。PyPI 包位于 https://pypi.python.org/pypi/cloudscraper/
Alternatively, clone this repository and run python setup.py install
.
或者,克隆此存储库并运行 python setup.py install
。
- Python 3.x Python 3.x 版
- Requests >= 2.9.2
请求数 >= 2.9.2 - requests_toolbelt >= 0.9.1
python setup.py install
will install the Python dependencies automatically. The javascript interpreters and/or engines you decide to use are the only things you need to install yourself, excluding js2py which is part of the requirements as the default.python setup.py install
将自动安装 Python 依赖项。您决定使用的 javascript 解释器和/或引擎是您唯一需要自己安装的东西,不包括 js2py,它是默认要求的一部分。
We support the following Javascript interpreters/engines.
我们支持以下 Javascript 解释器/引擎。
- ChakraCore: Library binaries can also be located here.
脉轮核心:库二进制文件也可以在这里找到。 - js2py: >=0.67
js2py:>=0.67 - native: Self made native python solver (Default)
native:自制的原生 python 求解器(默认) - Node.js
- V8: We use Sony's v8eval() python module.
V8 版本:我们使用 Sony 的 v8eval() python 模块。
The simplest way to use cloudscraper is by calling create_scraper()
.
使用 cloudscraper 的最简单方法是调用 create_scraper()。
import cloudscraper scraper = cloudscraper.create_scraper() # returns a CloudScraper instance # Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session print(scraper.get("http://somesite.com").text) # => "<!DOCTYPE html><html><head>..."
That's it... 就是这样。。。
Any requests made from this session object to websites protected by Cloudflare anti-bot will be handled automatically. Websites not using Cloudflare will be treated normally. You don't need to configure or call anything further, and you can effectively treat all websites as if they're not protected with anything.
从此会话对象向受 Cloudflare 反机器人保护的网站发出的任何请求都将被自动处理。未使用 Cloudflare 的网站将被正常处理。您无需进一步配置或调用任何内容,您可以有效地将所有网站视为不受任何保护。
You use cloudscraper exactly the same way you use Requests. cloudScraper
works identically to a Requests Session
object, just instead of calling requests.get()
or requests.post()
, you call scraper.get()
or scraper.post()
.
您使用 cloudscraper 的方式与使用 Requests 的方式完全相同。cloudScraper
的工作方式与 Requests Session
对象相同,只是您调用 scraper.get()
或 scraper.post
() 而不是调用 requests.get()
或 requests.post()。
Consult Requests' documentation for more information.
有关更多信息,请参阅 Requests 的文档。
Options 选项
Disable Cloudflare V1
Description 描述
If you don't want to even attempt Cloudflare v1 (Deprecated) solving..
如果你甚至不想尝试 Cloudflare v1 (已弃用) 解决..
Parameters 参数
Parameter 参数 | Value 价值 | Default 违约 |
---|---|---|
disableCloudflareV1 禁用CloudflareV1 | (boolean) (布尔值) | False 假 |
Example 例
scraper = cloudscraper.create_scraper(disableCloudflareV1=True)
Brotli
Description 描述
Brotli decompression support has been added, and it is enabled by default.
添加了 Brotli 解压缩支持,并且默认情况下处于启用状态。
Parameters 参数
Parameter 参数 | Value 价值 | Default 违约 |
---|---|---|
allow_brotli | (boolean) (布尔值) | True 真 |
Example 例
scraper = cloudscraper.create_scraper(allow_brotli=False)
Browser / User-Agent Filtering
Description 描述
Control how and which User-Agent is "randomly" selected.
控制 “随机” 选择用户代理的方式和方式。
Parameters 参数
Can be passed as an argument to create_scraper()
, get_tokens()
, get_cookie_string()
.
可以作为参数传递给 create_scraper()、
get_tokens()
、get_cookie_string()。
Parameter 参数 | Value 价值 | Default 违约 |
---|---|---|
browser 浏览器 | (string) chrome or firefox (字符串) Chrome 或 Firefox |
None 没有 |
Or 或
Parameter 参数 | Value 价值 | Default 违约 |
---|---|---|
browser 浏览器 | (dict) |
browser
dict Parameters
Parameter 参数 | Value | Default |
---|---|---|
browser | (string) chrome or firefox |
None |
mobile | (boolean) | True |
desktop | (boolean) | True |
platform | (string) 'linux', 'windows', 'darwin', 'android', 'ios' |
None |
custom | (string) | None |
Example
scraper = cloudscraper.create_scraper(browser='chrome')
or 或
# will give you only mobile chrome User-Agents on Android scraper = cloudscraper.create_scraper( browser={ 'browser': 'chrome', 'platform': 'android', 'desktop': False } ) # will give you only desktop firefox User-Agents on Windows scraper = cloudscraper.create_scraper( browser={ 'browser': 'firefox', 'platform': 'windows', 'mobile': False } ) # Custom will also try find the user-agent string in the browsers.json, # If a match is found, it will use the headers and cipherSuite from that "browser", # Otherwise a generic set of headers and cipherSuite will be used. scraper = cloudscraper.create_scraper( browser={ 'custom': 'ScraperBot/1.0', } )
Debug
Description
Prints out header and content information of the request for debugging.
Parameters
Can be set as an attribute via your cloudscraper
object or passed as an argument to create_scraper()
, get_tokens()
, get_cookie_string()
.
Parameter 参数 | Value | Default |
---|---|---|
debug | (boolean) | False |
Example
scraper = cloudscraper.create_scraper(debug=True)
Delays
Description
Cloudflare IUAM challenge requires the browser to wait ~5 seconds before submitting the challenge answer, If you would like to override this delay.
Parameters
Can be set as an attribute via your cloudscraper
object or passed as an argument to create_scraper()
, get_tokens()
, get_cookie_string()
.
Parameter 参数 | Value | Default |
---|---|---|
delay | (float) | extracted from IUAM page |
Example
scraper = cloudscraper.create_scraper(delay=10)
Existing session
Description:
If you already have an existing Requests session, you can pass it to the function create_scraper()
to continue using that session.
Parameters
Parameter 参数 | Value | Default |
---|---|---|
sess | (requests.session) | None |
Example
session = requests.session() scraper = cloudscraper.create_scraper(sess=session)
Note 注意
Unfortunately, not all of Requests session attributes are easily transferable, so if you run into problems with this,
You should replace your initial session initialization call
From:
sess = requests.session()
To:
sess = cloudscraper.create_scraper()
JavaScript Engines and Interpreters
Description
cloudscraper currently supports the following JavaScript Engines/Interpreters
- ChakraCore
- js2py
- native: Self made native python solver (Default)
- Node.js
- V8
Parameters
Can be set as an attribute via your cloudscraper
object or passed as an argument to create_scraper()
, get_tokens()
, get_cookie_string()
.
Parameter | Value | Default |
---|---|---|
interpreter | (string) | native |
Example
scraper = cloudscraper.create_scraper(interpreter='nodejs')
3rd Party Captcha Solvers
Description
cloudscraper
currently supports the following 3rd party Captcha solvers, should you require them.
- 2captcha
- anticaptcha
- CapSolver
- CapMonster Cloud
- deathbycaptcha
- 9kw
- return_response
Note
I am working on adding more 3rd party solvers, if you wish to have a service added that is not currently supported, please raise a support ticket on github.
Required Parameters
Can be set as an attribute via your cloudscraper
object or passed as an argument to create_scraper()
, get_tokens()
, get_cookie_string()
.
Parameter | Value | Default |
---|---|---|
captcha | (dict) | None |
2captcha
Required captcha
Parameters
Parameter | Value | Required | Default |
---|---|---|---|
provider | (string) 2captcha |
yes | |
api_key | (string) | yes | |
no_proxy | (boolean) | no | False |
Note
if proxies are set you can disable sending the proxies to 2captcha by setting no_proxy
to True
Example
scraper = cloudscraper.create_scraper( captcha={ 'provider': '2captcha', 'api_key': 'your_2captcha_api_key' } )
anticaptcha
Required captcha
Parameters
Parameter | Value | Required | Default |
---|---|---|---|
provider | (string) anticaptcha |
yes | |
api_key | (string) | yes | |
no_proxy | (boolean) | no | False |
Note
if proxies are set you can disable sending the proxies to anticaptcha by setting no_proxy
to True
Example
scraper = cloudscraper.create_scraper( captcha={ 'provider': 'anticaptcha', 'api_key': 'your_anticaptcha_api_key' } )
CapSolver
Required captcha
Parameters
Parameter | Value | Required | Default |
---|---|---|---|
provider | (string) captchaai |
yes | |
api_key | (string) | yes |
Example
scraper = cloudscraper.create_scraper( captcha={ 'provider': 'capsolver', 'api_key': 'your_captchaai_api_key' } )
CapMonster Cloud
Required captcha
Parameters
Parameter | Value | Required | Default |
---|---|---|---|
provider | (string) capmonster |
yes | |
clientKey | (string) | yes | |
no_proxy | (boolean) | no | False |
Note
if proxies are set you can disable sending the proxies to CapMonster by setting no_proxy
to True
Example
scraper = cloudscraper.create_scraper( captcha={ 'provider': 'capmonster', 'clientKey': 'your_capmonster_clientKey' } )
deathbycaptcha
Required captcha
Parameters
Parameter | Value | Required | Default |
---|---|---|---|
provider | (string) deathbycaptcha |
yes | |
username | (string) | yes | |
password | (string) | yes |
Example
scraper = cloudscraper.create_scraper( captcha={ 'provider': 'deathbycaptcha', 'username': 'your_deathbycaptcha_username', 'password': 'your_deathbycaptcha_password', } )
9kw
Required captcha
Parameters
Parameter | Value | Required | Default |
---|---|---|---|
provider | (string) 9kw |
yes | |
api_key | (string) | yes | |
maxtimeout | (int) | no | 180 |
Example
scraper = cloudscraper.create_scraper( captcha={ 'provider': '9kw', 'api_key': 'your_9kw_api_key', 'maxtimeout': 300 } )
return_response
Use this if you want the requests response payload without solving the Captcha.
Required captcha
Parameters
Parameter | Value | Required | Default |
---|---|---|---|
provider | (string) return_response |
yes |
Example
scraper = cloudscraper.create_scraper( captcha={'provider': 'return_response'} )
Integration
It's easy to integrate cloudscraper
with other applications and tools. Cloudflare uses two cookies as tokens: one to verify you made it past their challenge page and one to track your session. To bypass the challenge page, simply include both of these cookies (with the appropriate user-agent) in all HTTP requests you make.
To retrieve just the cookies (as a dictionary), use cloudscraper.get_tokens()
. To retrieve them as a full Cookie
HTTP header, use cloudscraper.get_cookie_string()
.
get_tokens
and get_cookie_string
both accept Requests' usual keyword arguments (like get_tokens(url, proxies={"http": "socks5://localhost:9050"})
).
Please read Requests' documentation on request arguments for more information.
User-Agent Handling
The two integration functions return a tuple of (cookie, user_agent_string)
.
You must use the same user-agent string for obtaining tokens and for making requests with those tokens, otherwise Cloudflare will flag you as a bot.
That means you have to pass the returned user_agent_string
to whatever script, tool, or service you are passing the tokens to (e.g. curl, or a specialized scraping tool), and it must use that passed user-agent when it makes HTTP requests.
Integration examples
Remember, you must always use the same user-agent when retrieving or using these cookies. These functions all return a tuple of (cookie_dict, user_agent_string)
.
Retrieving a cookie dict through a proxy
get_tokens
is a convenience function for returning a Python dict containing Cloudflare's session cookies. For demonstration, we will configure this request to use a proxy. (Please note that if you request Cloudflare clearance tokens through a proxy, you must always use the same proxy when those tokens are passed to the server. Cloudflare requires that the challenge-solving IP and the visitor IP stay the same.)
If you do not wish to use a proxy, just don't pass the proxies
keyword argument. These convenience functions support all of Requests' normal keyword arguments, like params
, data
, and headers
.
import cloudscraper proxies = {"http": "http://localhost:8080", "https": "http://localhost:8080"} tokens, user_agent = cloudscraper.get_tokens("http://somesite.com", proxies=proxies) print(tokens) # => { 'cf_clearance': 'c8f913c707b818b47aa328d81cab57c349b1eee5-1426733163-3600', '__cfduid': 'dd8ec03dfdbcb8c2ea63e920f1335c1001426733158' }
Retrieving a cookie string
get_cookie_string
is a convenience function for returning the tokens as a string for use as a Cookie
HTTP header value.
This is useful when crafting an HTTP request manually, or working with an external application or library that passes on raw cookie headers.
import cloudscraper cookie_value, user_agent = cloudscraper.get_cookie_string('http://somesite.com') print('GET / HTTP/1.1\nCookie: {}\nUser-Agent: {}\n'.format(cookie_value, user_agent)) # GET / HTTP/1.1 # Cookie: cf_clearance=c8f913c707b818b47aa328d81cab57c349b1eee5-1426733163-3600; __cfduid=dd8ec03dfdbcb8c2ea63e920f1335c1001426733158 # User-Agent: Some/User-Agent String
curl example
Here is an example of integrating cloudscraper with curl. As you can see, all you have to do is pass the cookies and user-agent to curl.
import subprocess import cloudscraper # With get_tokens() cookie dict: # tokens, user_agent = cloudscraper.get_tokens("http://somesite.com") # cookie_arg = 'cf_clearance={}; __cfduid={}'.format(tokens['cf_clearance'], tokens['__cfduid']) # With get_cookie_string() cookie header; recommended for curl and similar external applications: cookie_arg, user_agent = cloudscraper.get_cookie_string('http://somesite.com') # With a custom user-agent string you can optionally provide: # ua = "Scraping Bot" # cookie_arg, user_agent = cloudscraper.get_cookie_string("http://somesite.com", user_agent=ua) result = subprocess.check_output( [ 'curl', '--cookie', cookie_arg, '-A', user_agent, 'http://somesite.com' ] )
Trimmed down version. Prints page contents of any site protected with Cloudflare, via curl.
Warning: shell=True
can be dangerous to use with subprocess
in real code.
url = "http://somesite.com" cookie_arg, user_agent = cloudscraper.get_cookie_string(url) cmd = "curl --cookie {cookie_arg} -A {user_agent} {url}" print( subprocess.check_output( cmd.format( cookie_arg=cookie_arg, user_agent=user_agent, url=url ), shell=True ) )
Cryptography
Description
Control communication between client and server
Parameters
Can be passed as an argument to create_scraper()
.
Parameter | Value | Default |
---|---|---|
cipherSuite | (string) | None |
ecdhCurve | (string) | prime256v1 |
server_hostname | (string) | None |
Example
# Some servers require the use of a more complex ecdh curve than the default "prime256v1" # It may can solve handshake failure scraper = cloudscraper.create_scraper(ecdhCurve='secp384r1')
# Manipulate server_hostname scraper = cloudscraper.create_scraper(server_hostname='www.somesite.com') scraper.get( 'https://backend.hosting.com/', headers={'Host': 'www.somesite.com'} )