提交 d850238c 编写于 作者: M Mikhail Korobov

add AUTOTHROTTLE_TARGET_CONCURRENCY option and expand AutoThrottle docs

上级 63317531
......@@ -12,33 +12,56 @@ Design goals
1. be nicer to sites instead of using default download delay of zero
2. automatically adjust scrapy to the optimum crawling speed, so the user
doesn't have to tune the download delays and concurrent requests to find the
optimum one. The user only needs to specify the maximum concurrent requests
doesn't have to tune the download delays to find the optimum one.
The user only needs to specify the maximum concurrent requests
it allows, and the extension does the rest.
.. _autothrottle-algorithm:
How it works
============
In Scrapy, the download latency is measured as the time elapsed between
establishing the TCP connection and receiving the HTTP headers.
AutoThrottle extension adjusts download delays dynamically to make spider send
:setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` concurrent requests on average
to each remote website.
Note that these latencies are very hard to measure accurately in a cooperative
multitasking environment because Scrapy may be busy processing a spider
callback, for example, and unable to attend downloads. However, these latencies
should still give a reasonable estimate of how busy Scrapy (and ultimately, the
server) is, and this extension builds on that premise.
It uses download latency to compute the delays. The main idea is the
following: if a server needs ``latency`` seconds to respond, a client
should send a request each ``latency/N`` seconds to have ``N`` requests
processed in parallel.
.. _autothrottle-algorithm:
Instead of adjusting the delays one can just set a small fixed
download delay and impose hard limits on concurrency using
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` or
:setting:`CONCURRENT_REQUESTS_PER_IP` options. It will provide a similar
effect, but there are some important differences:
* because the download delay is small there will be occasional bursts
of requests;
* often non-200 (error) responses can be returned faster than regular
responses, so with a small download delay and a hard concurrency limit
crawler will be sending requests to server faster when server starts to
return errors. But this is an opposite of what crawler should do - in case
of errors it makes more sense to slow down: these errors may be caused by
the high request rate.
AutoThrottle doesn't have these issues.
Throttling algorithm
====================
This adjusts download delays based on the following rules:
AutoThrottle algorithm adjusts download delays based on the following rules:
1. spiders always start with a download delay of
:setting:`AUTOTHROTTLE_START_DELAY`;
2. when a response is received, the download delay is adjusted to the
average of previous download delay and the latency of the response.
2. when a response is received, the target download delay is calculated as
``latency / N`` where ``latency`` is a latency of the response,
and ``N`` is :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY`.
3. download delay for next requests is set to the average of previous
download delay and the target download delay;
4. latencies of non-200 responses are not allowed to decrease the delay;
5. download delay can't become less than :setting:`DOWNLOAD_DELAY` or greater
than :setting:`AUTOTHROTTLE_MAX_DELAY`
.. note:: The AutoThrottle extension honours the standard Scrapy settings for
concurrency and delay. This means that it will respect
......@@ -46,6 +69,17 @@ This adjusts download delays based on the following rules:
:setting:`CONCURRENT_REQUESTS_PER_IP` options and
never set a download delay lower than :setting:`DOWNLOAD_DELAY`.
.. _download-latency:
In Scrapy, the download latency is measured as the time elapsed between
establishing the TCP connection and receiving the HTTP headers.
Note that these latencies are very hard to measure accurately in a cooperative
multitasking environment because Scrapy may be busy processing a spider
callback, for example, and unable to attend downloads. However, these latencies
should still give a reasonable estimate of how busy Scrapy (and ultimately, the
server) is, and this extension builds on that premise.
Settings
========
......@@ -88,6 +122,34 @@ Default: ``60.0``
The maximum download delay (in seconds) to be set in case of high latencies.
.. setting:: AUTOTHROTTLE_TARGET_CONCURRENCY
AUTOTHROTTLE_TARGET_CONCURRENCY
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Default: ``1.0``
Average number of requests Scrapy should be sending in parallel to remote
websites.
By default, AutoThrottle adjusts the delay to send a single
concurrent request to each of the remote websites. Set this option to
a higher value (e.g. ``2.0``) to increase the throughput and the load on remote
servers. A lower ``AUTOTHROTTLE_TARGET_CONCURRENCY`` value
(e.g. ``0.5``) makes the crawler more conservative and polite.
Note that :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`
and :setting:`CONCURRENT_REQUESTS_PER_IP` options are still respected
when AutoThrottle extension is enabled. This means that if
``AUTOTHROTTLE_TARGET_CONCURRENCY`` is set to a value higher than
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` or
:setting:`CONCURRENT_REQUESTS_PER_IP`, the crawler won't reach this number
of concurrent requests.
At every given time point Scrapy can be sending more or less concurrent
requests than ``AUTOTHROTTLE_TARGET_CONCURRENCY``; it is a suggested
value the crawler tries to approach, not a hard limit.
.. setting:: AUTOTHROTTLE_DEBUG
AUTOTHROTTLE_DEBUG
......
......@@ -187,7 +187,6 @@ Default: ``16``
The maximum number of concurrent (ie. simultaneous) requests that will be
performed by the Scrapy downloader.
.. setting:: CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_DOMAIN
......@@ -198,6 +197,10 @@ Default: ``8``
The maximum number of concurrent (ie. simultaneous) requests that will be
performed to any single domain.
See also: :ref:`topics-autothrottle` and its
:setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` option.
.. setting:: CONCURRENT_REQUESTS_PER_IP
CONCURRENT_REQUESTS_PER_IP
......@@ -211,9 +214,9 @@ performed to any single IP. If non-zero, the
used instead. In other words, concurrency limits will be applied per IP, not
per domain.
This setting also affects :setting:`DOWNLOAD_DELAY`:
if :setting:`CONCURRENT_REQUESTS_PER_IP` is non-zero, download delay is
enforced per IP, not per domain.
This setting also affects :setting:`DOWNLOAD_DELAY` and
:ref:`topics-autothrottle`: if :setting:`CONCURRENT_REQUESTS_PER_IP`
is non-zero, download delay is enforced per IP, not per domain.
.. setting:: DEFAULT_ITEM_CLASS
......
......@@ -14,6 +14,7 @@ class AutoThrottle(object):
raise NotConfigured
self.debug = crawler.settings.getbool("AUTOTHROTTLE_DEBUG")
self.target_concurrency = crawler.settings.getfloat("AUTOTHROTTLE_TARGET_CONCURRENCY")
crawler.signals.connect(self._spider_opened, signal=signals.spider_opened)
crawler.signals.connect(self._response_downloaded, signal=signals.response_downloaded)
......@@ -67,12 +68,17 @@ class AutoThrottle(object):
def _adjust_delay(self, slot, latency, response):
"""Define delay adjustment policy"""
# Adjust the delay to be closer to latency.
new_delay = (slot.delay + latency) / 2.0
# If a server needs `latency` seconds to respond then
# we should send a request each `latency/N` seconds
# to have N requests processed in parallel
target_delay = latency / self.target_concurrency
# If latency is bigger than old delay, then use latency instead of mean.
# Adjust the delay to make it closer to target_delay
new_delay = (slot.delay + target_delay) / 2.0
# If target delay is bigger than old delay, then use it instead of mean.
# It works better with problematic sites.
new_delay = max(latency, new_delay)
new_delay = max(target_delay, new_delay)
# Make sure self.mindelay <= new_delay <= self.max_delay
new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
......
......@@ -24,6 +24,7 @@ AUTOTHROTTLE_ENABLED = False
AUTOTHROTTLE_DEBUG = False
AUTOTHROTTLE_MAX_DELAY = 60.0
AUTOTHROTTLE_START_DELAY = 5.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
BOT_NAME = 'scrapybot'
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册