@@ -12,39 +12,73 @@ Design goals
12
12
13
13
1. be nicer to sites instead of using default download delay of zero
14
14
2. automatically adjust scrapy to the optimum crawling speed, so the user
15
- doesn't have to tune the download delays and concurrent requests to find the
16
- optimum one. The user only needs to specify the maximum concurrent requests
15
+ doesn't have to tune the download delays to find the optimum one.
16
+ The user only needs to specify the maximum concurrent requests
17
17
it allows, and the extension does the rest.
18
18
19
+ .. _autothrottle-algorithm :
20
+
19
21
How it works
20
22
============
21
23
22
- In Scrapy, the download latency is measured as the time elapsed between
23
- establishing the TCP connection and receiving the HTTP headers.
24
+ AutoThrottle extension adjusts download delays dynamically to make spider send
25
+ :setting: `AUTOTHROTTLE_TARGET_CONCURRENCY ` concurrent requests on average
26
+ to each remote website.
24
27
25
- Note that these latencies are very hard to measure accurately in a cooperative
26
- multitasking environment because Scrapy may be busy processing a spider
27
- callback, for example, and unable to attend downloads. However, these latencies
28
- should still give a reasonable estimate of how busy Scrapy (and ultimately, the
29
- server) is, and this extension builds on that premise.
28
+ It uses download latency to compute the delays. The main idea is the
29
+ following: if a server needs ``latency `` seconds to respond, a client
30
+ should send a request each ``latency/N `` seconds to have ``N `` requests
31
+ processed in parallel.
30
32
31
- .. _autothrottle-algorithm :
33
+ Instead of adjusting the delays one can just set a small fixed
34
+ download delay and impose hard limits on concurrency using
35
+ :setting: `CONCURRENT_REQUESTS_PER_DOMAIN ` or
36
+ :setting: `CONCURRENT_REQUESTS_PER_IP ` options. It will provide a similar
37
+ effect, but there are some important differences:
38
+
39
+ * because the download delay is small there will be occasional bursts
40
+ of requests;
41
+ * often non-200 (error) responses can be returned faster than regular
42
+ responses, so with a small download delay and a hard concurrency limit
43
+ crawler will be sending requests to server faster when server starts to
44
+ return errors. But this is an opposite of what crawler should do - in case
45
+ of errors it makes more sense to slow down: these errors may be caused by
46
+ the high request rate.
47
+
48
+ AutoThrottle doesn't have these issues.
32
49
33
50
Throttling algorithm
34
51
====================
35
52
36
- This adjusts download delays and concurrency based on the following rules:
53
+ AutoThrottle algorithm adjusts download delays based on the following rules:
37
54
38
- 1. spiders always start with one concurrent request and a download delay of
39
- :setting: `AUTOTHROTTLE_START_DELAY `
40
- 2. when a response is received, the download delay is adjusted to the
41
- average of previous download delay and the latency of the response.
55
+ 1. spiders always start with a download delay of
56
+ :setting: `AUTOTHROTTLE_START_DELAY `;
57
+ 2. when a response is received, the target download delay is calculated as
58
+ ``latency / N `` where ``latency `` is a latency of the response,
59
+ and ``N `` is :setting: `AUTOTHROTTLE_TARGET_CONCURRENCY `.
60
+ 3. download delay for next requests is set to the average of previous
61
+ download delay and the target download delay;
62
+ 4. latencies of non-200 responses are not allowed to decrease the delay;
63
+ 5. download delay can't become less than :setting: `DOWNLOAD_DELAY ` or greater
64
+ than :setting: `AUTOTHROTTLE_MAX_DELAY `
42
65
43
66
.. note :: The AutoThrottle extension honours the standard Scrapy settings for
44
- concurrency and delay. This means that it will never set a download delay
45
- lower than :setting: `DOWNLOAD_DELAY ` or a concurrency higher than
46
- :setting: `CONCURRENT_REQUESTS_PER_DOMAIN `
47
- (or :setting: `CONCURRENT_REQUESTS_PER_IP `, depending on which one you use).
67
+ concurrency and delay. This means that it will respect
68
+ :setting: `CONCURRENT_REQUESTS_PER_DOMAIN ` and
69
+ :setting: `CONCURRENT_REQUESTS_PER_IP ` options and
70
+ never set a download delay lower than :setting: `DOWNLOAD_DELAY `.
71
+
72
+ .. _download-latency :
73
+
74
+ In Scrapy, the download latency is measured as the time elapsed between
75
+ establishing the TCP connection and receiving the HTTP headers.
76
+
77
+ Note that these latencies are very hard to measure accurately in a cooperative
78
+ multitasking environment because Scrapy may be busy processing a spider
79
+ callback, for example, and unable to attend downloads. However, these latencies
80
+ should still give a reasonable estimate of how busy Scrapy (and ultimately, the
81
+ server) is, and this extension builds on that premise.
48
82
49
83
Settings
50
84
========
@@ -88,6 +122,34 @@ Default: ``60.0``
88
122
89
123
The maximum download delay (in seconds) to be set in case of high latencies.
90
124
125
+ .. setting :: AUTOTHROTTLE_TARGET_CONCURRENCY
126
+
127
+ AUTOTHROTTLE_TARGET_CONCURRENCY
128
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
129
+
130
+ Default: ``1.0 ``
131
+
132
+ Average number of requests Scrapy should be sending in parallel to remote
133
+ websites.
134
+
135
+ By default, AutoThrottle adjusts the delay to send a single
136
+ concurrent request to each of the remote websites. Set this option to
137
+ a higher value (e.g. ``2.0 ``) to increase the throughput and the load on remote
138
+ servers. A lower ``AUTOTHROTTLE_TARGET_CONCURRENCY `` value
139
+ (e.g. ``0.5 ``) makes the crawler more conservative and polite.
140
+
141
+ Note that :setting: `CONCURRENT_REQUESTS_PER_DOMAIN `
142
+ and :setting: `CONCURRENT_REQUESTS_PER_IP ` options are still respected
143
+ when AutoThrottle extension is enabled. This means that if
144
+ ``AUTOTHROTTLE_TARGET_CONCURRENCY `` is set to a value higher than
145
+ :setting: `CONCURRENT_REQUESTS_PER_DOMAIN ` or
146
+ :setting: `CONCURRENT_REQUESTS_PER_IP `, the crawler won't reach this number
147
+ of concurrent requests.
148
+
149
+ At every given time point Scrapy can be sending more or less concurrent
150
+ requests than ``AUTOTHROTTLE_TARGET_CONCURRENCY ``; it is a suggested
151
+ value the crawler tries to approach, not a hard limit.
152
+
91
153
.. setting :: AUTOTHROTTLE_DEBUG
92
154
93
155
AUTOTHROTTLE_DEBUG
0 commit comments