author: 专注Python实战,分享爬虫与数据分析干货
title: Python爬虫实战⑪|Scrapy框架入门,搭建第一个爬虫项目
update: 2026-04-26
tags: Python,爬虫,Scrapy,框架,Spider,Item,CrawlSpider
作者:专注Python实战,分享爬虫与数据分析干货
更新时间:2026年4月
适合人群:有Python爬虫基础、想用专业框架提升效率的开发者
前言:为什么需要Scrapy?
用requests写爬虫,每写一个网站都要重复写:
- 下载器(requests)
- 解析器(BeautifulSoup)
- 数据存储(CSV/Excel)
- 翻页逻辑(自己写循环)
- 异常处理(try/except)
Scrapy = 爬虫界的"脚手架"。 把重复的事情自动化,你只关心数据提取逻辑。
如果你用requests写爬虫是"造轮子",那Scrapy就是给你现成的"汽车"。
一、Scrapy安装
pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple
Windows用户注意: 如果安装报错,需要先安装 Microsoft Visual C++ Build Tools,或者用conda安装:
conda install scrapy
验证安装:
scrapy version
# Scrapy 2.11.x
二、创建第一个Scrapy项目
2.1 命令行创建项目
scrapy startproject douban_spider
项目结构自动生成:
douban_spider/
├── scrapy.cfg # 配置文件
└── douban_spider/ # Python包
├── __init__.py
├── items.py # 数据项定义
├── middlewares.py # 中间件
├── pipelines.py # 数据管道
└── spiders/ # 爬虫目录
└── __init__.py
2.2 目录结构详解
| 文件/目录 | 作用 |
|---|---|
| scrapy.cfg | 项目配置文件 |
| items.py | 定义数据结构(字段名、类型) |
| middlewares.py | 中间件(请求/响应拦截) |
| pipelines.py | 数据管道(清洗、存储) |
| spiders/ | 存放各个爬虫 |
三、定义Item(数据结构)
3.1 为什么要定义Item?
Item就像数据库的表结构,提前定义好字段,Scrapy会自动处理数据传递。
3.2 编写items.py
import scrapy
class DoubanMovieItem(scrapy.Item):
"""豆瓣电影数据项"""
rank = scrapy.Field() # 排名
title = scrapy.Field() # 电影名称
rating = scrapy.Field() # 评分
quote = scrapy.Field() # 短评
url = scrapy.Field() # 详情链接
def __repr__(self):
"""打印时只显示有数据的字段"""
return f"<Movie: {self.get('rank', 'N/A')}. {self.get('title', 'N/A')}>"
四、编写第一个Spider
4.1 在spiders目录下创建爬虫
# douban_spider/spiders/movie_spider.py
import scrapy
from douban_spider.items import DoubanMovieItem
class DoubanMovieSpider(scrapy.Spider):
"""豆瓣电影爬虫"""
# 爬虫名称(唯一标识)
name = "douban_movie"
# 允许爬取的域名
allowed_domains = ["movie.douban.com"]
# 起始URL列表
start_urls = [
"https://movie.douban.com/top250",
]
def parse(self, response):
"""
解析函数:每抓到一个页面,自动调用此函数
response = 页面响应对象
"""
# 用XPath提取所有电影条目
movie_items = response.xpath("//div[@class='item']")
for movie in movie_items:
item = DoubanMovieItem()
# 排名
rank = movie.xpath(".//em[@class='']/text()").get()
item["rank"] = rank.strip() if rank else ""
# 电影名
title = movie.xpath(".//span[@class='title'][1]/text()").get()
item["title"] = title.strip() if title else ""
# 评分
rating = movie.xpath(".//span[@class='rating_num']/text()").get()
item["rating"] = rating.strip() if rating else ""
# 短评
quote = movie.xpath(".//span[@class='inq']/text()").get()
item["quote"] = quote.strip() if quote else ""
# 详情链接
url = movie.xpath(".//div[@class='hd']/a/@href").get()
item["url"] = url if url else ""
# 返回item(Scrapy自动传给Pipeline处理)
yield item
# 提取下一页链接并继续爬取
next_page = response.xpath("//span[@class='next']/a/@href").get()
if next_page:
# 构造完整URL
next_url = response.urljoin(next_page)
print(f"\n发现下一页: {next_url}")
# 继续爬取下一页
yield scrapy.Request(next_url, callback=self.parse)
4.2 运行爬虫
cd douban_spider
scrapy crawl douban_movie
五、数据导出
Scrapy自带多种导出格式,一条命令搞定:
# 导出为JSON
scrapy crawl douban_movie -o movies.json
# 导出为CSV
scrapy crawl douban_movie -o movies.csv
# 导出为JSON Lines(每行一条,适合大数据)
scrapy crawl douban_movie -o movies.jl
# 导出为XML
scrapy crawl douban_movie -o movies.xml
运行效果:
2026-04-26 10:00:00 [scrapy.core.engine] INFO: Spider opened
2026-04-26 10:00:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting ...
2026-04-26 10:00:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.douban.com/top250>
<Movie: 1. 肖申克的救赎>
2026-04-26 10:00:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://movie.douban.com/top250>
<Movie: 2. 霸王别姬>
...
2026-04-26 10:00:15 [scrapy.core.engine] INFO: Closing spider (finished)
2026-04-26 10:00:15 [scrapy.statscollectors] INFO: Dumping Spider stats:
{'downloader/request_count': 10, 'item_scraped_count': 250}
六、Item Pipeline(数据管道)
Pipeline负责数据清洗、验证和存储。Scrapy自动把Item传给Pipeline处理。
6.1 启用Pipeline
在 settings.py 中启用:
ITEM_PIPELINES = {
"douban_spider.pipelines.DoubanSpiderPipeline": 300,
"douban_spider.pipelines.MovieDataCleanPipeline": 400,
}
# 数字越小优先级越高
6.2 编写Pipeline
# douban_spider/pipelines.py
import json
from itemadapter import ItemAdapter
class DoubanSpiderPipeline:
"""基础Pipeline:打印日志"""
def open_spider(self, spider):
"""爬虫启动时调用"""
print(f"\n{'='*50}")
print(f"爬虫 {spider.name} 启动")
print(f"{'='*50}\n")
def process_item(self, item, spider):
"""处理每个Item"""
adapter = ItemAdapter(item)
print(f" 抓取: {adapter.get('rank')}. {adapter.get('title')}")
return item # 必须返回,才能传给下一个Pipeline
def close_spider(self, spider):
"""爬虫关闭时调用"""
print(f"\n{'='*50}")
print(f"爬虫 {spider.name} 结束")
print(f"{'='*50}\n")
class MovieDataCleanPipeline:
"""数据清洗Pipeline"""
def process_item(self, item, spider):
adapter = ItemAdapter(item)
# 清理排名(去除空格)
rank = adapter.get("rank")
if rank:
adapter["rank"] = rank.strip()
# 清理评分(转浮点数)
rating = adapter.get("rating")
if rating:
try:
adapter["rating"] = float(rating.strip())
except ValueError:
adapter["rating"] = 0.0
# 清理标题(去除多余空格)
title = adapter.get("title")
if title:
adapter["title"] = " ".join(title.split())
# 过滤低评分
if adapter.get("rating", 0) < 9.0:
spider.logger.warning(f"评分过低,已过滤: {adapter.get('title')}")
# 返回None = 丢弃这条数据
return None
return item
class MovieStoragePipeline:
"""存储Pipeline:将数据保存到文件"""
def open_spider(self, spider):
self.file = open("movies_from_pipeline.json", "w", encoding="utf-8")
self.file.write("[\n")
def process_item(self, item, spider):
adapter = ItemAdapter(item)
line = json.dumps(dict(adapter), ensure_ascii=False)
self.file.write(line + ",\n")
return item
def close_spider(self, spider):
self.file.write("\n]")
self.file.close()
print(f"\n数据已保存到 movies_from_pipeline.json")
七、Settings配置
7.1 settings.py常用配置
# douban_spider/settings.py
BOT_NAME = "douban_spider"
SPIDER_MODULES = ["douban_spider.spiders"]
NEWSPIDER_MODULE = "douban_spider.spiders"
# 爬虫名称
ROBOTSTXT_OBEY = True # 遵守robots.txt
# 请求头设置(全局)
DEFAULT_REQUEST_HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Referer": "https://movie.douban.com/",
}
# 请求间隔(秒)
DOWNLOAD_DELAY = 1
# 并发请求数
CONCURRENT_REQUESTS_PER_DOMAIN = 1 # 同一域名同时1个请求
# 是否启用Cookie
COOKIES_ENABLED = True
# 启用Pipeline
ITEM_PIPELINES = {
"douban_spider.pipelines.DoubanSpiderPipeline": 300,
"douban_spider.pipelines.MovieDataCleanPipeline": 400,
"douban_spider.pipelines.MovieStoragePipeline": 500,
}
# 日志级别
LOG_LEVEL = "INFO"
# 自动限速
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
八、CrawlSpider(规则爬虫)
对于结构规范的列表页面,CrawlSpider可以自动处理翻页,无需手动写解析函数。
# douban_spider/spiders/crawl_spider.py
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from douban_spider.items import DoubanMovieItem
class DoubanCrawlSpider(CrawlSpider):
"""规则爬虫:自动翻页"""
name = "douban_crawl"
allowed_domains = ["movie.douban.com"]
start_urls = ["https://movie.douban.com/top250"]
rules = (
# 自动提取匹配规则的链接并跟进爬取
Rule(
LinkExtractor(allow=r"start=\d+"),
callback="parse_item",
follow=True # 继续跟进链接
),
)
def parse_item(self, response):
"""解析每个页面的数据"""
for movie in response.xpath("//div[@class='item']"):
item = DoubanMovieItem()
item["rank"] = movie.xpath(".//em/text()").get()
item["title"] = movie.xpath(".//span[@class='title'][1]/text()").get()
item["rating"] = movie.xpath(".//span[@class='rating_num']/text()").get()
item["quote"] = movie.xpath(".//span[@class='inq']/text()").get()
item["url"] = movie.xpath(".//div[@class='hd']/a/@href").get()
yield item
运行:
scrapy crawl douban_crawl -o movies_crawl.json
九、知识卡
| 概念 | 说明 |
|---|---|
| scrapy startproject | 创建Scrapy项目 |
| scrapy genspider | 创建爬虫文件 |
| scrapy crawl name | 运行爬虫 |
| -o filename.json/csv | 导出数据 |
| name | 爬虫唯一标识 |
| start_urls | 起始URL列表 |
| parse(response) | 解析函数,response为响应对象 |
| yield item | 返回Item,Scrapy自动传给Pipeline |
| scrapy.Request(url, callback) | 手动发起请求 |
| response.urljoin(url) | 构造完整URL |
| Rules + LinkExtractor | CrawlSpider规则,自动翻页 |
| ITEM_PIPELINES | Pipeline配置 |
| DOWNLOAD_DELAY | 请求间隔 |
十、课后作业
必做题:
- 安装Scrapy,创建第一个项目
- 用Scrapy爬取豆瓣Top250前3页数据,导出为JSON
- 编写一个Pipeline实现数据清洗
选做题:
- 使用CrawlSpider重写豆瓣爬虫
- 实现多个Pipeline的组合处理
完成作业的同学,把运行截图发到评论区!
Scrapy = 爬虫生产级框架。 从"会写爬虫"到"写好爬虫",Scrapy是必经之路。
本篇要点:
- Scrapy安装与项目创建
- Item定义
- Spider编写与运行
- 数据导出
- Pipeline数据管道
- Settings配置
- CrawlSpider规则爬虫
下一篇学习Scrapy Item与Pipeline——深入数据处理流程。
收藏 + 关注,专栏更新不迷路!
有问题欢迎评论区留言,大家一起讨论!
标签:Python | Scrapy | 爬虫框架 | Spider | Item | Pipeline | CrawlSpider
5482

被折叠的 条评论
为什么被折叠?



