add AUTOTHROTTLE_TARGET_CONCURRENCY option and expand AutoThrottle docs

d850238c · Mikhail Korobov · 63317531 · d850238c · d850238c · d850238c
4 changed file
--- a/docs/topics/autothrottle.rst
+++ b/docs/topics/autothrottle.rst
@@ -12,33 +12,56 @@ Design goals

 1. be nicer to sites instead of using default download delay of zero
 2. automatically adjust scrapy to the optimum crawling speed, so the user
-   doesn't have to tune the download delays and concurrent requests to find the
-   optimum one. The user only needs to specify the maximum concurrent requests
+   doesn't have to tune the download delays to find the optimum one.
+   The user only needs to specify the maximum concurrent requests
   it allows, and the extension does the rest.

+.. _autothrottle-algorithm:
+
 How it works
 ============

-In Scrapy, the download latency is measured as the time elapsed between
-establishing the TCP connection and receiving the HTTP headers.
+AutoThrottle extension adjusts download delays dynamically to make spider send
+:setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` concurrent requests on average
+to each remote website.

-Note that these latencies are very hard to measure accurately in a cooperative
-multitasking environment because Scrapy may be busy processing a spider
-callback, for example, and unable to attend downloads. However, these latencies
-should still give a reasonable estimate of how busy Scrapy (and ultimately, the
-server) is, and this extension builds on that premise.
+It uses download latency to compute the delays. The main idea is the
+following: if a server needs ``latency`` seconds to respond, a client
+should send a request each ``latency/N`` seconds to have ``N`` requests
+processed in parallel.

-.. _autothrottle-algorithm:
+Instead of adjusting the delays one can just set a small fixed
+download delay and impose hard limits on concurrency using
+:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` or
+:setting:`CONCURRENT_REQUESTS_PER_IP` options. It will provide a similar
+effect, but there are some important differences:
+
+* because the download delay is small there will be occasional bursts
+  of requests;
+* often non-200 (error) responses can be returned faster than regular
+  responses, so with a small download delay and a hard concurrency limit
+  crawler will be sending requests to server faster when server starts to
+  return errors. But this is an opposite of what crawler should do - in case
+  of errors it makes more sense to slow down: these errors may be caused by
+  the high request rate.
+
+AutoThrottle doesn't have these issues.

 Throttling algorithm
 ====================

-This adjusts download delays based on the following rules:
+AutoThrottle algorithm adjusts download delays based on the following rules:

 1. spiders always start with a download delay of
   :setting:`AUTOTHROTTLE_START_DELAY`;
-2. when a response is received, the download delay is adjusted to the
-   average of previous download delay and the latency of the response.
+2. when a response is received, the target download delay is calculated as
+   ``latency / N`` where ``latency`` is a latency of the response,
+   and ``N`` is :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY`.
+3. download delay for next requests is set to the average of previous
+   download delay and the target download delay;
+4. latencies of non-200 responses are not allowed to decrease the delay;
+5. download delay can't become less than :setting:`DOWNLOAD_DELAY` or greater
+   than :setting:`AUTOTHROTTLE_MAX_DELAY`

 .. note:: The AutoThrottle extension honours the standard Scrapy settings for
   concurrency and delay. This means that it will respect
@@ -46,6 +69,17 @@ This adjusts download delays based on the following rules:
   :setting:`CONCURRENT_REQUESTS_PER_IP` options and
   never set a download delay lower than :setting:`DOWNLOAD_DELAY`.

+.. _download-latency:
+
+In Scrapy, the download latency is measured as the time elapsed between
+establishing the TCP connection and receiving the HTTP headers.
+
+Note that these latencies are very hard to measure accurately in a cooperative
+multitasking environment because Scrapy may be busy processing a spider
+callback, for example, and unable to attend downloads. However, these latencies
+should still give a reasonable estimate of how busy Scrapy (and ultimately, the
+server) is, and this extension builds on that premise.
+
 Settings
 ========

@@ -88,6 +122,34 @@ Default: ``60.0``

 The maximum download delay (in seconds) to be set in case of high latencies.

+.. setting:: AUTOTHROTTLE_TARGET_CONCURRENCY
+
+AUTOTHROTTLE_TARGET_CONCURRENCY
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Default: ``1.0``
+
+Average number of requests Scrapy should be sending in parallel to remote
+websites.
+
+By default, AutoThrottle adjusts the delay to send a single
+concurrent request to each of the remote websites. Set this option to
+a higher value (e.g. ``2.0``) to increase the throughput and the load on remote
+servers. A lower ``AUTOTHROTTLE_TARGET_CONCURRENCY`` value
+(e.g. ``0.5``) makes the crawler more conservative and polite.
+
+Note that :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`
+and :setting:`CONCURRENT_REQUESTS_PER_IP` options are still respected
+when AutoThrottle extension is enabled. This means that if
+``AUTOTHROTTLE_TARGET_CONCURRENCY`` is set to a value higher than
+:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` or
+:setting:`CONCURRENT_REQUESTS_PER_IP`, the crawler won't reach this number
+of concurrent requests.
+
+At every given time point Scrapy can be sending more or less concurrent
+requests than ``AUTOTHROTTLE_TARGET_CONCURRENCY``; it is a suggested
+value the crawler tries to approach, not a hard limit.
+
 .. setting:: AUTOTHROTTLE_DEBUG

 AUTOTHROTTLE_DEBUG

--- a/docs/topics/settings.rst
+++ b/docs/topics/settings.rst
@@ -187,7 +187,6 @@ Default: ``16``
 The maximum number of concurrent (ie. simultaneous) requests that will be
 performed by the Scrapy downloader.

-
 .. setting:: CONCURRENT_REQUESTS_PER_DOMAIN

 CONCURRENT_REQUESTS_PER_DOMAIN
@@ -198,6 +197,10 @@ Default: ``8``
 The maximum number of concurrent (ie. simultaneous) requests that will be
 performed to any single domain.

+See also: :ref:`topics-autothrottle` and its
+:setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` option.
+
+
 .. setting:: CONCURRENT_REQUESTS_PER_IP

 CONCURRENT_REQUESTS_PER_IP
@@ -211,9 +214,9 @@ performed to any single IP. If non-zero, the
 used instead. In other words, concurrency limits will be applied per IP, not
 per domain.

-This setting also affects :setting:`DOWNLOAD_DELAY`:
-if :setting:`CONCURRENT_REQUESTS_PER_IP` is non-zero, download delay is
-enforced per IP, not per domain.
+This setting also affects :setting:`DOWNLOAD_DELAY` and
+:ref:`topics-autothrottle`: if :setting:`CONCURRENT_REQUESTS_PER_IP`
+is non-zero, download delay is enforced per IP, not per domain.


 .. setting:: DEFAULT_ITEM_CLASS

--- a/scrapy/extensions/throttle.py
+++ b/scrapy/extensions/throttle.py
@@ -14,6 +14,7 @@ class AutoThrottle(object):
            raise NotConfigured

        self.debug = crawler.settings.getbool("AUTOTHROTTLE_DEBUG")
+        self.target_concurrency = crawler.settings.getfloat("AUTOTHROTTLE_TARGET_CONCURRENCY")
        crawler.signals.connect(self._spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(self._response_downloaded, signal=signals.response_downloaded)

@@ -67,12 +68,17 @@ class AutoThrottle(object):
    def _adjust_delay(self, slot, latency, response):
        """Define delay adjustment policy"""

-        # Adjust the delay to be closer to latency.
-        new_delay = (slot.delay + latency) / 2.0
+        # If a server needs `latency` seconds to respond then
+        # we should send a request each `latency/N` seconds
+        # to have N requests processed in parallel
+        target_delay = latency / self.target_concurrency

-        # If latency is bigger than old delay, then use latency instead of mean.
+        # Adjust the delay to make it closer to target_delay
+        new_delay = (slot.delay + target_delay) / 2.0
+
+        # If target delay is bigger than old delay, then use it instead of mean.
        # It works better with problematic sites.
-        new_delay = max(latency, new_delay)
+        new_delay = max(target_delay, new_delay)

        # Make sure self.mindelay <= new_delay <= self.max_delay
        new_delay = min(max(self.mindelay, new_delay), self.maxdelay)

--- a/scrapy/settings/default_settings.py
+++ b/scrapy/settings/default_settings.py
@@ -24,6 +24,7 @@ AUTOTHROTTLE_ENABLED = False
 AUTOTHROTTLE_DEBUG = False
 AUTOTHROTTLE_MAX_DELAY = 60.0
 AUTOTHROTTLE_START_DELAY = 5.0
+AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

 BOT_NAME = 'scrapybot'