Skip to content

Commit de22b6f

Browse files
committed
Merge pull request scrapy#1721 from Digenis/offsitemw_subdomains
[MRG+1] tests+doc for subdomains in offsite middleware
2 parents a7b8613 + 1cffa99 commit de22b6f

File tree

3 files changed

+10
-4
lines changed

3 files changed

+10
-4
lines changed

docs/topics/spider-middleware.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,6 +273,9 @@ OffsiteMiddleware
273273

274274
This middleware filters out every request whose host names aren't in the
275275
spider's :attr:`~scrapy.spiders.Spider.allowed_domains` attribute.
276+
All subdomains of any domain in the list are also allowed.
277+
E.g. the rule ``www.example.org`` will also allow ``bob.www.example.org``
278+
but not ``www2.example.com`` nor ``example.com``.
276279

277280
When your spider returns a request for a domain not belonging to those
278281
covered by the spider, this middleware will log a debug message similar to

docs/topics/spiders.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ scrapy.Spider
7676

7777
An optional list of strings containing domains that this spider is
7878
allowed to crawl. Requests for URLs not belonging to the domain names
79-
specified in this list won't be followed if
79+
specified in this list (or their subdomains) won't be followed if
8080
:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` is enabled.
8181

8282
.. attribute:: start_urls

tests/test_spidermiddleware_offsite.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,21 +16,24 @@ def setUp(self):
1616
self.mw.spider_opened(self.spider)
1717

1818
def _get_spiderargs(self):
19-
return dict(name='foo', allowed_domains=['scrapytest.org', 'scrapy.org'])
19+
return dict(name='foo', allowed_domains=['scrapytest.org', 'scrapy.org', 'scrapy.test.org'])
2020

2121
def test_process_spider_output(self):
2222
res = Response('http://scrapytest.org')
2323

2424
onsite_reqs = [Request('http://scrapytest.org/1'),
2525
Request('http://scrapy.org/1'),
2626
Request('http://sub.scrapy.org/1'),
27-
Request('http://offsite.tld/letmepass', dont_filter=True)]
27+
Request('http://offsite.tld/letmepass', dont_filter=True),
28+
Request('http://scrapy.test.org/')]
2829
offsite_reqs = [Request('http://scrapy2.org'),
2930
Request('http://offsite.tld/'),
3031
Request('http://offsite.tld/scrapytest.org'),
3132
Request('http://offsite.tld/rogue.scrapytest.org'),
3233
Request('http://rogue.scrapytest.org.haha.com'),
33-
Request('http://roguescrapytest.org')]
34+
Request('http://roguescrapytest.org'),
35+
Request('http://test.org/'),
36+
Request('http://notscrapy.test.org/')]
3437
reqs = onsite_reqs + offsite_reqs
3538

3639
out = list(self.mw.process_spider_output(res, reqs, self.spider))

0 commit comments

Comments
 (0)