Skip to content

Commit 98707e1

Browse files
committed
Merge pull request scrapy#1324 from scrapy/autothrottle
Autothrottle enhancements
2 parents 7c98855 + d850238 commit 98707e1

File tree

4 files changed

+117
-33
lines changed

4 files changed

+117
-33
lines changed

docs/topics/autothrottle.rst

Lines changed: 81 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -12,39 +12,73 @@ Design goals
1212

1313
1. be nicer to sites instead of using default download delay of zero
1414
2. automatically adjust scrapy to the optimum crawling speed, so the user
15-
doesn't have to tune the download delays and concurrent requests to find the
16-
optimum one. The user only needs to specify the maximum concurrent requests
15+
doesn't have to tune the download delays to find the optimum one.
16+
The user only needs to specify the maximum concurrent requests
1717
it allows, and the extension does the rest.
1818

19+
.. _autothrottle-algorithm:
20+
1921
How it works
2022
============
2123

22-
In Scrapy, the download latency is measured as the time elapsed between
23-
establishing the TCP connection and receiving the HTTP headers.
24+
AutoThrottle extension adjusts download delays dynamically to make spider send
25+
:setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` concurrent requests on average
26+
to each remote website.
2427

25-
Note that these latencies are very hard to measure accurately in a cooperative
26-
multitasking environment because Scrapy may be busy processing a spider
27-
callback, for example, and unable to attend downloads. However, these latencies
28-
should still give a reasonable estimate of how busy Scrapy (and ultimately, the
29-
server) is, and this extension builds on that premise.
28+
It uses download latency to compute the delays. The main idea is the
29+
following: if a server needs ``latency`` seconds to respond, a client
30+
should send a request each ``latency/N`` seconds to have ``N`` requests
31+
processed in parallel.
3032

31-
.. _autothrottle-algorithm:
33+
Instead of adjusting the delays one can just set a small fixed
34+
download delay and impose hard limits on concurrency using
35+
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` or
36+
:setting:`CONCURRENT_REQUESTS_PER_IP` options. It will provide a similar
37+
effect, but there are some important differences:
38+
39+
* because the download delay is small there will be occasional bursts
40+
of requests;
41+
* often non-200 (error) responses can be returned faster than regular
42+
responses, so with a small download delay and a hard concurrency limit
43+
crawler will be sending requests to server faster when server starts to
44+
return errors. But this is an opposite of what crawler should do - in case
45+
of errors it makes more sense to slow down: these errors may be caused by
46+
the high request rate.
47+
48+
AutoThrottle doesn't have these issues.
3249

3350
Throttling algorithm
3451
====================
3552

36-
This adjusts download delays and concurrency based on the following rules:
53+
AutoThrottle algorithm adjusts download delays based on the following rules:
3754

38-
1. spiders always start with one concurrent request and a download delay of
39-
:setting:`AUTOTHROTTLE_START_DELAY`
40-
2. when a response is received, the download delay is adjusted to the
41-
average of previous download delay and the latency of the response.
55+
1. spiders always start with a download delay of
56+
:setting:`AUTOTHROTTLE_START_DELAY`;
57+
2. when a response is received, the target download delay is calculated as
58+
``latency / N`` where ``latency`` is a latency of the response,
59+
and ``N`` is :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY`.
60+
3. download delay for next requests is set to the average of previous
61+
download delay and the target download delay;
62+
4. latencies of non-200 responses are not allowed to decrease the delay;
63+
5. download delay can't become less than :setting:`DOWNLOAD_DELAY` or greater
64+
than :setting:`AUTOTHROTTLE_MAX_DELAY`
4265

4366
.. note:: The AutoThrottle extension honours the standard Scrapy settings for
44-
concurrency and delay. This means that it will never set a download delay
45-
lower than :setting:`DOWNLOAD_DELAY` or a concurrency higher than
46-
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN`
47-
(or :setting:`CONCURRENT_REQUESTS_PER_IP`, depending on which one you use).
67+
concurrency and delay. This means that it will respect
68+
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` and
69+
:setting:`CONCURRENT_REQUESTS_PER_IP` options and
70+
never set a download delay lower than :setting:`DOWNLOAD_DELAY`.
71+
72+
.. _download-latency:
73+
74+
In Scrapy, the download latency is measured as the time elapsed between
75+
establishing the TCP connection and receiving the HTTP headers.
76+
77+
Note that these latencies are very hard to measure accurately in a cooperative
78+
multitasking environment because Scrapy may be busy processing a spider
79+
callback, for example, and unable to attend downloads. However, these latencies
80+
should still give a reasonable estimate of how busy Scrapy (and ultimately, the
81+
server) is, and this extension builds on that premise.
4882

4983
Settings
5084
========
@@ -88,6 +122,34 @@ Default: ``60.0``
88122

89123
The maximum download delay (in seconds) to be set in case of high latencies.
90124

125+
.. setting:: AUTOTHROTTLE_TARGET_CONCURRENCY
126+
127+
AUTOTHROTTLE_TARGET_CONCURRENCY
128+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
129+
130+
Default: ``1.0``
131+
132+
Average number of requests Scrapy should be sending in parallel to remote
133+
websites.
134+
135+
By default, AutoThrottle adjusts the delay to send a single
136+
concurrent request to each of the remote websites. Set this option to
137+
a higher value (e.g. ``2.0``) to increase the throughput and the load on remote
138+
servers. A lower ``AUTOTHROTTLE_TARGET_CONCURRENCY`` value
139+
(e.g. ``0.5``) makes the crawler more conservative and polite.
140+
141+
Note that :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`
142+
and :setting:`CONCURRENT_REQUESTS_PER_IP` options are still respected
143+
when AutoThrottle extension is enabled. This means that if
144+
``AUTOTHROTTLE_TARGET_CONCURRENCY`` is set to a value higher than
145+
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` or
146+
:setting:`CONCURRENT_REQUESTS_PER_IP`, the crawler won't reach this number
147+
of concurrent requests.
148+
149+
At every given time point Scrapy can be sending more or less concurrent
150+
requests than ``AUTOTHROTTLE_TARGET_CONCURRENCY``; it is a suggested
151+
value the crawler tries to approach, not a hard limit.
152+
91153
.. setting:: AUTOTHROTTLE_DEBUG
92154

93155
AUTOTHROTTLE_DEBUG

docs/topics/settings.rst

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,6 @@ Default: ``16``
187187
The maximum number of concurrent (ie. simultaneous) requests that will be
188188
performed by the Scrapy downloader.
189189

190-
191190
.. setting:: CONCURRENT_REQUESTS_PER_DOMAIN
192191

193192
CONCURRENT_REQUESTS_PER_DOMAIN
@@ -198,6 +197,10 @@ Default: ``8``
198197
The maximum number of concurrent (ie. simultaneous) requests that will be
199198
performed to any single domain.
200199

200+
See also: :ref:`topics-autothrottle` and its
201+
:setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` option.
202+
203+
201204
.. setting:: CONCURRENT_REQUESTS_PER_IP
202205

203206
CONCURRENT_REQUESTS_PER_IP
@@ -211,9 +214,9 @@ performed to any single IP. If non-zero, the
211214
used instead. In other words, concurrency limits will be applied per IP, not
212215
per domain.
213216

214-
This setting also affects :setting:`DOWNLOAD_DELAY`:
215-
if :setting:`CONCURRENT_REQUESTS_PER_IP` is non-zero, download delay is
216-
enforced per IP, not per domain.
217+
This setting also affects :setting:`DOWNLOAD_DELAY` and
218+
:ref:`topics-autothrottle`: if :setting:`CONCURRENT_REQUESTS_PER_IP`
219+
is non-zero, download delay is enforced per IP, not per domain.
217220

218221

219222
.. setting:: DEFAULT_ITEM_CLASS

scrapy/extensions/throttle.py

Lines changed: 23 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ def __init__(self, crawler):
1414
raise NotConfigured
1515

1616
self.debug = crawler.settings.getbool("AUTOTHROTTLE_DEBUG")
17+
self.target_concurrency = crawler.settings.getfloat("AUTOTHROTTLE_TARGET_CONCURRENCY")
1718
crawler.signals.connect(self._spider_opened, signal=signals.spider_opened)
1819
crawler.signals.connect(self._response_downloaded, signal=signals.response_downloaded)
1920

@@ -28,15 +29,13 @@ def _spider_opened(self, spider):
2829

2930
def _min_delay(self, spider):
3031
s = self.crawler.settings
31-
return getattr(spider, 'download_delay', 0.0) or \
32-
s.getfloat('AUTOTHROTTLE_MIN_DOWNLOAD_DELAY') or \
33-
s.getfloat('DOWNLOAD_DELAY')
32+
return getattr(spider, 'download_delay', s.getfloat('DOWNLOAD_DELAY'))
3433

3534
def _max_delay(self, spider):
36-
return self.crawler.settings.getfloat('AUTOTHROTTLE_MAX_DELAY', 60.0)
35+
return self.crawler.settings.getfloat('AUTOTHROTTLE_MAX_DELAY')
3736

3837
def _start_delay(self, spider):
39-
return max(self.mindelay, self.crawler.settings.getfloat('AUTOTHROTTLE_START_DELAY', 5.0))
38+
return max(self.mindelay, self.crawler.settings.getfloat('AUTOTHROTTLE_START_DELAY'))
4039

4140
def _response_downloaded(self, response, request, spider):
4241
key, slot = self._get_slot(request, spider)
@@ -68,13 +67,27 @@ def _get_slot(self, request, spider):
6867

6968
def _adjust_delay(self, slot, latency, response):
7069
"""Define delay adjustment policy"""
71-
# If latency is bigger than old delay, then use latency instead of mean.
72-
# It works better with problematic sites
73-
new_delay = min(max(self.mindelay, latency, (slot.delay + latency) / 2.0), self.maxdelay)
70+
71+
# If a server needs `latency` seconds to respond then
72+
# we should send a request each `latency/N` seconds
73+
# to have N requests processed in parallel
74+
target_delay = latency / self.target_concurrency
75+
76+
# Adjust the delay to make it closer to target_delay
77+
new_delay = (slot.delay + target_delay) / 2.0
78+
79+
# If target delay is bigger than old delay, then use it instead of mean.
80+
# It works better with problematic sites.
81+
new_delay = max(target_delay, new_delay)
82+
83+
# Make sure self.mindelay <= new_delay <= self.max_delay
84+
new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
7485

7586
# Dont adjust delay if response status != 200 and new delay is smaller
7687
# than old one, as error pages (and redirections) are usually small and
7788
# so tend to reduce latency, thus provoking a positive feedback by
7889
# reducing delay instead of increase.
79-
if response.status == 200 or new_delay > slot.delay:
80-
slot.delay = new_delay
90+
if response.status != 200 and new_delay <= slot.delay:
91+
return
92+
93+
slot.delay = new_delay

scrapy/settings/default_settings.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,12 @@
2020

2121
AJAXCRAWL_ENABLED = False
2222

23+
AUTOTHROTTLE_ENABLED = False
24+
AUTOTHROTTLE_DEBUG = False
25+
AUTOTHROTTLE_MAX_DELAY = 60.0
26+
AUTOTHROTTLE_START_DELAY = 5.0
27+
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
28+
2329
BOT_NAME = 'scrapybot'
2430

2531
CLOSESPIDER_TIMEOUT = 0

0 commit comments

Comments
 (0)