java2man
diff --git a/‎docs/topics/autothrottle.rst
Lines changed: 81 additions & 19 deletions b/‎docs/topics/autothrottle.rst
Lines changed: 81 additions & 19 deletions
diff --git a/‎docs/topics/settings.rst
Lines changed: 7 additions & 4 deletions b/‎docs/topics/settings.rst
Lines changed: 7 additions & 4 deletions
diff --git a/‎scrapy/extensions/throttle.py
Lines changed: 23 additions & 10 deletions b/‎scrapy/extensions/throttle.py
Lines changed: 23 additions & 10 deletions
diff --git a/‎scrapy/settings/default_settings.py
Lines changed: 6 additions & 0 deletions b/‎scrapy/settings/default_settings.py
Lines changed: 6 additions & 0 deletions
@@ -12,39 +12,73 @@ Design goals
 
 1. be nicer to sites instead of using default download delay of zero
 2. automatically adjust scrapy to the optimum crawling speed, so the user
-   doesn't have to tune the download delays and concurrent requests to find the
-   optimum one. The user only needs to specify the maximum concurrent requests
+   doesn't have to tune the download delays to find the optimum one.
+   The user only needs to specify the maximum concurrent requests
    it allows, and the extension does the rest.
 
+.. _autothrottle-algorithm:
+
 How it works
 ============
 
-In Scrapy, the download latency is measured as the time elapsed between
-establishing the TCP connection and receiving the HTTP headers.
+AutoThrottle extension adjusts download delays dynamically to make spider send
+:setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` concurrent requests on average
+to each remote website.
 
-Note that these latencies are very hard to measure accurately in a cooperative
-multitasking environment because Scrapy may be busy processing a spider
-callback, for example, and unable to attend downloads. However, these latencies
-should still give a reasonable estimate of how busy Scrapy (and ultimately, the
-server) is, and this extension builds on that premise.
+It uses download latency to compute the delays. The main idea is the
+following: if a server needs ``latency`` seconds to respond, a client
+should send a request each ``latency/N`` seconds to have ``N`` requests
+processed in parallel.
 
-.. _autothrottle-algorithm:
+Instead of adjusting the delays one can just set a small fixed
+download delay and impose hard limits on concurrency using
+:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` or
+:setting:`CONCURRENT_REQUESTS_PER_IP` options. It will provide a similar
+effect, but there are some important differences:
+
+* because the download delay is small there will be occasional bursts
+  of requests;
+* often non-200 (error) responses can be returned faster than regular
+  responses, so with a small download delay and a hard concurrency limit
+  crawler will be sending requests to server faster when server starts to
+  return errors. But this is an opposite of what crawler should do - in case
+  of errors it makes more sense to slow down: these errors may be caused by
+  the high request rate.
+
+AutoThrottle doesn't have these issues.
 
 Throttling algorithm
 ====================
 
-This adjusts download delays and concurrency based on the following rules:
+AutoThrottle algorithm adjusts download delays based on the following rules:
 
-1. spiders always start with one concurrent request and a download delay of
-   :setting:`AUTOTHROTTLE_START_DELAY`
-2. when a response is received, the download delay is adjusted to the
-   average of previous download delay and the latency of the response.
+1. spiders always start with a download delay of
+   :setting:`AUTOTHROTTLE_START_DELAY`;
+2. when a response is received, the target download delay is calculated as
+   ``latency / N`` where ``latency`` is a latency of the response,
+   and ``N`` is :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY`.
+3. download delay for next requests is set to the average of previous
+   download delay and the target download delay;
+4. latencies of non-200 responses are not allowed to decrease the delay;
+5. download delay can't become less than :setting:`DOWNLOAD_DELAY` or greater
+   than :setting:`AUTOTHROTTLE_MAX_DELAY`
 
 .. note:: The AutoThrottle extension honours the standard Scrapy settings for
-   concurrency and delay. This means that it will never set a download delay
-   lower than :setting:`DOWNLOAD_DELAY` or a concurrency higher than
-   :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`
-   (or :setting:`CONCURRENT_REQUESTS_PER_IP`, depending on which one you use).
+   concurrency and delay. This means that it will respect
+   :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` and
+   :setting:`CONCURRENT_REQUESTS_PER_IP` options and
+   never set a download delay lower than :setting:`DOWNLOAD_DELAY`.
+
+.. _download-latency:
+
+In Scrapy, the download latency is measured as the time elapsed between
+establishing the TCP connection and receiving the HTTP headers.
+
+Note that these latencies are very hard to measure accurately in a cooperative
+multitasking environment because Scrapy may be busy processing a spider
+callback, for example, and unable to attend downloads. However, these latencies
+should still give a reasonable estimate of how busy Scrapy (and ultimately, the
+server) is, and this extension builds on that premise.
 
 Settings
 ========
@@ -88,6 +122,34 @@ Default: ``60.0``
 
 The maximum download delay (in seconds) to be set in case of high latencies.
 
+.. setting:: AUTOTHROTTLE_TARGET_CONCURRENCY
+
+AUTOTHROTTLE_TARGET_CONCURRENCY
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Default: ``1.0``
+
+Average number of requests Scrapy should be sending in parallel to remote
+websites.
+
+By default, AutoThrottle adjusts the delay to send a single
+concurrent request to each of the remote websites. Set this option to
+a higher value (e.g. ``2.0``) to increase the throughput and the load on remote
+servers. A lower ``AUTOTHROTTLE_TARGET_CONCURRENCY`` value
+(e.g. ``0.5``) makes the crawler more conservative and polite.
+
+Note that :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`
+and :setting:`CONCURRENT_REQUESTS_PER_IP` options are still respected
+when AutoThrottle extension is enabled. This means that if
+``AUTOTHROTTLE_TARGET_CONCURRENCY`` is set to a value higher than
+:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` or
+:setting:`CONCURRENT_REQUESTS_PER_IP`, the crawler won't reach this number
+of concurrent requests.
+
+At every given time point Scrapy can be sending more or less concurrent
+requests than ``AUTOTHROTTLE_TARGET_CONCURRENCY``; it is a suggested
+value the crawler tries to approach, not a hard limit.
+
 .. setting:: AUTOTHROTTLE_DEBUG
 
 AUTOTHROTTLE_DEBUG
 
@@ -187,7 +187,6 @@ Default: ``16``
 The maximum number of concurrent (ie. simultaneous) requests that will be
 performed by the Scrapy downloader.
 
-
 .. setting:: CONCURRENT_REQUESTS_PER_DOMAIN
 
 CONCURRENT_REQUESTS_PER_DOMAIN
@@ -198,6 +197,10 @@ Default: ``8``
 The maximum number of concurrent (ie. simultaneous) requests that will be
 performed to any single domain.
 
+See also: :ref:`topics-autothrottle` and its
+:setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` option.
+
+
 .. setting:: CONCURRENT_REQUESTS_PER_IP
 
 CONCURRENT_REQUESTS_PER_IP
@@ -211,9 +214,9 @@ performed to any single IP. If non-zero, the
 used instead. In other words, concurrency limits will be applied per IP, not
 per domain.
 
-This setting also affects :setting:`DOWNLOAD_DELAY`:
-if :setting:`CONCURRENT_REQUESTS_PER_IP` is non-zero, download delay is
-enforced per IP, not per domain.
+This setting also affects :setting:`DOWNLOAD_DELAY` and
+:ref:`topics-autothrottle`: if :setting:`CONCURRENT_REQUESTS_PER_IP`
+is non-zero, download delay is enforced per IP, not per domain.
 
 
 .. setting:: DEFAULT_ITEM_CLASS
 
@@ -14,6 +14,7 @@ def __init__(self, crawler):
             raise NotConfigured
 
         self.debug = crawler.settings.getbool("AUTOTHROTTLE_DEBUG")
+        self.target_concurrency = crawler.settings.getfloat("AUTOTHROTTLE_TARGET_CONCURRENCY")
         crawler.signals.connect(self._spider_opened, signal=signals.spider_opened)
         crawler.signals.connect(self._response_downloaded, signal=signals.response_downloaded)
 
@@ -28,15 +29,13 @@ def _spider_opened(self, spider):
 
     def _min_delay(self, spider):
         s = self.crawler.settings
-        return getattr(spider, 'download_delay', 0.0) or \
-            s.getfloat('AUTOTHROTTLE_MIN_DOWNLOAD_DELAY') or \
-            s.getfloat('DOWNLOAD_DELAY')
+        return getattr(spider, 'download_delay', s.getfloat('DOWNLOAD_DELAY'))
 
     def _max_delay(self, spider):
-        return self.crawler.settings.getfloat('AUTOTHROTTLE_MAX_DELAY', 60.0)
+        return self.crawler.settings.getfloat('AUTOTHROTTLE_MAX_DELAY')
 
     def _start_delay(self, spider):
-        return max(self.mindelay, self.crawler.settings.getfloat('AUTOTHROTTLE_START_DELAY', 5.0))
+        return max(self.mindelay, self.crawler.settings.getfloat('AUTOTHROTTLE_START_DELAY'))
 
     def _response_downloaded(self, response, request, spider):
         key, slot = self._get_slot(request, spider)
@@ -68,13 +67,27 @@ def _get_slot(self, request, spider):
 
     def _adjust_delay(self, slot, latency, response):
         """Define delay adjustment policy"""
-        # If latency is bigger than old delay, then use latency instead of mean.
-        # It works better with problematic sites
-        new_delay = min(max(self.mindelay, latency, (slot.delay + latency) / 2.0), self.maxdelay)
+
+        # If a server needs `latency` seconds to respond then
+        # we should send a request each `latency/N` seconds
+        # to have N requests processed in parallel
+        target_delay = latency / self.target_concurrency
+
+        # Adjust the delay to make it closer to target_delay
+        new_delay = (slot.delay + target_delay) / 2.0
+
+        # If target delay is bigger than old delay, then use it instead of mean.
+        # It works better with problematic sites.
+        new_delay = max(target_delay, new_delay)
+
+        # Make sure self.mindelay <= new_delay <= self.max_delay
+        new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
 
         # Dont adjust delay if response status != 200 and new delay is smaller
         # than old one, as error pages (and redirections) are usually small and
         # so tend to reduce latency, thus provoking a positive feedback by
         # reducing delay instead of increase.
-        if response.status == 200 or new_delay > slot.delay:
-            slot.delay = new_delay
+        if response.status != 200 and new_delay <= slot.delay:
+            return
+
+        slot.delay = new_delay
@@ -20,6 +20,12 @@
 
 AJAXCRAWL_ENABLED = False
 
+AUTOTHROTTLE_ENABLED = False
+AUTOTHROTTLE_DEBUG = False
+AUTOTHROTTLE_MAX_DELAY = 60.0
+AUTOTHROTTLE_START_DELAY = 5.0
+AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
+
 BOT_NAME = 'scrapybot'
 
 CLOSESPIDER_TIMEOUT = 0