第44课：Python｜网络爬虫基础【Requests、BeautifulSoup网页解析实战】

原创于 2026-06-29 21:01:03 发布 · 150 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#python #爬虫 #beautifulsoup #人工智能

50节课 Python 从入门到精通专栏收录该内容

44 篇文章

订阅专栏

在这里插入图片描述

文章目录

📖 开篇导读
🎯 学习目标
📚 知识点理论精讲
💻 代码案例实操
⚠️ 易错点避坑总结
📝 课后实战练习题
🧠 知识点思维导图总结
🔜 下节课预告
- 第45课：爬虫进阶：正则爬虫、XPath、lxml网页解析高阶实战
🔗《50节课 Python 从入门到精通》系列课程导航

📖 开篇导读

在互联网时代，海量的数据散布在万千网页之中。无论是做市场分析、学术研究，还是训练机器学习模型，我们常常需要从网站上自动提取信息。这种用程序模拟浏览器请求网页、解析并提取数据的自动化过程，就是网络爬虫（Web Crawler / Spider）。

Python凭借其简洁的语法和丰富的第三方库，成为爬虫开发的首选语言。本课将介绍两个最核心的爬虫工具：Requests（发送HTTP请求）和BeautifulSoup（解析HTML/XML），让你能够从零开始编写自己的爬虫程序。

💡 工作场景：

市场调研：抓取竞争对手的商品价格、用户评论。
新闻聚合：定期抓取多个新闻网站的头条，建立自己的阅读平台。
学术研究：获取公开数据集（如论文信息、统计数据）。
监控告警：监控网站内容变化（如价格变动、新课程上线）。

⚠️ 重要提示：爬虫应当遵守法律法规和网站robots.txt协议，不得对目标网站造成压力，不得抓取隐私或版权数据。本课仅用于学习和研究目的，请勿用于非法用途。

学完本课，你将能够使用Requests获取网页源码，使用BeautifulSoup解析HTML，提取你想要的任何信息，并将数据保存为结构化文件。

🎯 学习目标

目标编号	具体掌握内容	对应面试/工作价值
1️⃣	理解HTTP基本概念（请求方法、状态码、请求头）	分析网络请求
2️⃣	掌握Requests库的安装和基本用法（GET、POST、添加headers、超时）	发送网络请求
3️⃣	掌握BeautifulSoup库的安装和基本用法（选择器、find/find_all、CSS选择器）	解析HTML文档
4️⃣	能够从网页中提取文本、链接、图片地址、特定属性	数据抓取核心
5️⃣	处理常见反爬机制（User-Agent、请求头模拟）	应对初级反爬
6️⃣	能够将抓取的数据保存到CSV文件	数据持久化

🔥 面试考点：“说说网络爬虫的基本流程”“如何处理请求被拒绝的情况？”“BeautifulSoup的find和find_all区别？”“如何提取HTML中某个class的所有元素？”

📚 知识点理论精讲

一、网络爬虫概述

1.1 什么是爬虫？

网络爬虫是一种自动获取网页内容的程序。它模拟浏览器向服务器发送HTTP请求，接收响应，然后从HTML/JSON等数据中提取所需信息。

1.2 爬虫的基本流程

发起请求：使用Requests库向目标URL发送HTTP请求。
获取响应：服务器返回HTML、JSON或其它格式数据。
解析内容：使用BeautifulSoup等库解析HTML，提取数据。
存储数据：将提取的数据保存到CSV、JSON、数据库等。

1.3 爬虫的合法性与道德

robots.txt：网站根目录下的文件，规定了哪些路径允许爬取。可在https://example.com/robots.txt查看。
请求频率：不要过快，应设置合理的延时（如time.sleep(1)），避免对服务器造成压力。
遵守法律：不得爬取涉及隐私、版权或需要登录的数据，不得用于商业牟利。

二、HTTP基础

2.1 HTTP请求方法

GET：请求指定资源，参数附在URL后（如百度搜索）。
POST：向服务器提交数据（如表单登录、上传文件）。

2.2 常见状态码

状态码	含义	处理方式
200	OK，请求成功	正常解析
301/302	重定向	Requests会自动跟随（可禁用）
403	禁止访问	可能需要添加headers或更换IP
404	未找到	URL错误
500	服务器内部错误	稍后重试

2.3 请求头（Headers）

服务器通过请求头识别客户端。常见字段：

User-Agent：标识浏览器类型，默认Requests的UA是python-requests/2.x.x，容易被封。通常设置为常见的浏览器UA。
Referer：来源页面。
Cookie：保持登录状态。

三、Requests库详解

3.1 安装

pip install requests

3.2 发送GET请求

import requests

url = 'https://httpbin.org/get'
response = requests.get(url)
print(response.status_code)   # 200
print(response.text)          # 响应体字符串

3.3 带参数的GET请求

params = {'q': 'python', 'page': 1}
response = requests.get('https://httpbin.org/get', params=params)
print(response.url)   # 实际请求的URL

3.4 发送POST请求

data = {'username': 'admin', 'password': '123456'}
response = requests.post('https://httpbin.org/post', data=data)
# 发送JSON数据
response = requests.post(url, json={'key': 'value'})

3.5 添加请求头

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)

3.6 超时与异常处理

try:
    response = requests.get(url, timeout=5)   # 5秒超时
    response.raise_for_status()               # 非200状态码抛出异常
except requests.Timeout:
    print("请求超时")
except requests.RequestException as e:
    print(f"请求错误: {e}")

3.7 使用代理

proxies = {
    'http': 'http://127.0.0.1:8080',
    'https': 'http://127.0.0.1:8080'
}
response = requests.get(url, proxies=proxies)

四、BeautifulSoup库详解

4.1 安装与解析器

pip install beautifulsoup4

BeautifulSoup支持多种解析器，常用html.parser（内置）和lxml（更快，需安装pip install lxml）。

4.2 创建BeautifulSoup对象

from bs4 import BeautifulSoup

html_doc = "<html><head><title>标题</title></head>...</html>"
soup = BeautifulSoup(html_doc, 'html.parser')

4.3 查找元素

`find()`和`find_all()`

# 查找第一个<a>标签
a_tag = soup.find('a')
# 查找所有<a>标签
all_a = soup.find_all('a')

# 按属性查找
div = soup.find('div', class_='content')   # class属性用class_
div = soup.find('div', id='main')
# 使用attrs字典
div = soup.find('div', attrs={'data-id': '123'})

`select()` CSS选择器

# 选择class为“title”的元素
items = soup.select('.title')
# 选择id为“main”的元素
main = soup.select('#main')
# 选择所有a标签内href属性
links = soup.select('a[href]')
# 组合选择
divs = soup.select('div.content p')

4.4 提取数据

# 获取文本（自动去除标签）
title = soup.title.text
# 获取属性值
link = a_tag.get('href')
# 获取所有内容（含标签）
html_str = str(a_tag)

4.5 遍历元素

# 遍历所有直接子节点
for child in soup.div.children:
    print(child)
# 获取父节点
parent = tag.parent
# 获取下一个兄弟
next_sib = tag.next_sibling

💻 代码案例实操

案例1：抓取网页标题——简单入门

"""
simple_crawler.py
用Requests获取百度首页，用BeautifulSoup提取标题
"""

import requests
from bs4 import BeautifulSoup

url = 'https://www.baidu.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
try:
    response = requests.get(url, headers=headers, timeout=5)
    response.raise_for_status()
    response.encoding = 'utf-8'  # 百度网页是utf-8
    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.title.string
    print(f"网页标题: {title}")
except requests.RequestException as e:
    print(f"请求失败: {e}")

案例2：抓取新闻列表（从搜狐新闻）

"""
news_crawler.py
抓取搜狐新闻首页的新闻标题和链接（示例URL稳定）
"""

import requests
from bs4 import BeautifulSoup

url = 'https://www.sohu.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
try:
    resp = requests.get(url, headers=headers, timeout=10)
    resp.encoding = 'utf-8'
    soup = BeautifulSoup(resp.text, 'html.parser')
    
    # 查找新闻标题（实际需要根据网页结构调整选择器）
    # 假设新闻链接在 <a> 标签且包含 'article' 或 class中有'title'
    news_items = soup.select('a[href*="/a/"]')  # 示例：搜狐新闻链接常包含/a/
    for item in news_items[:10]:  # 取前10条
        title = item.get_text(strip=True)
        link = item.get('href')
        if title and link:
            print(f"{title}\n{link}\n")
except Exception as e:
    print(e)

案例3：爬取豆瓣电影Top250（经典案例）

"""
douban_top250.py
抓取豆瓣电影Top250的片名、评分、评价人数
"""

import requests
from bs4 import BeautifulSoup
import time

def fetch_movie_page(start):
    url = f'https://movie.douban.com/top250?start={start}&filter='
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    response.encoding = 'utf-8'
    return response.text

def parse_movies(html):
    soup = BeautifulSoup(html, 'html.parser')
    movie_list = []
    for item in soup.select('.item'):
        title = item.select_one('.title').text.strip()
        rating = item.select_one('.rating_num').text.strip()
        # 评价人数
        people = item.select_one('.star span:last-child').text.strip()[:-3]  # 去掉"人评价"
        movie_list.append({
            'title': title,
            'rating': rating,
            'people': int(people)
        })
    return movie_list

all_movies = []
for start in range(0, 250, 25):
    print(f"正在抓取 start={start}...")
    html = fetch_movie_page(start)
    movies = parse_movies(html)
    all_movies.extend(movies)
    time.sleep(2)   # 礼貌延时

print(f"共抓取 {len(all_movies)} 部电影")
for m in all_movies[:5]:
    print(f"{m['title']} - {m['rating']}分 - {m['people']}人评价")

案例4：抓取分页数据（动漫榜单）

"""
pagination_crawler.py
演示如何自动翻页抓取数据（以漫画网站为例）
"""

import requests
from bs4 import BeautifulSoup
import time

def fetch_page(page):
    url = f'https://www.example.com/anime?page={page}'
    headers = {'User-Agent': 'Mozilla/5.0'}
    resp = requests.get(url, headers=headers)
    return resp.text

def parse_page(html):
    soup = BeautifulSoup(html, 'html.parser')
    titles = soup.select('.anime-title')
    return [title.text.strip() for title in titles]

all_titles = []
page = 1
while True:
    print(f"正在抓取第{page}页...")
    html = fetch_page(page)
    titles = parse_page(html)
    if not titles:
        break
    all_titles.extend(titles)
    page += 1
    time.sleep(1)

print(f"共抓取{len(all_titles)}个条目")

案例5：保存数据到CSV文件

"""
save_to_csv.py
将爬取的数据存入CSV文件
"""

import csv
# 假设 all_movies 是前面抓取的列表
with open('douban_top250.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.DictWriter(f, fieldnames=['title', 'rating', 'people'])
    writer.writeheader()
    writer.writerows(all_movies)
print("数据已保存到douban_top250.csv")

案例6：处理相对URL和下载图片

"""
download_images.py
从网页中提取图片链接并下载
"""

import requests
from bs4 import BeautifulSoup
import os

url = 'https://example.com/gallery'
headers = {'User-Agent': 'Mozilla/5.0'}
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'html.parser')
img_tags = soup.select('img')
os.makedirs('images', exist_ok=True)
for i, img in enumerate(img_tags):
    src = img.get('src')
    if src and src.startswith('http'):
        img_url = src
    else:
        img_url = requests.compat.urljoin(url, src)
    try:
        img_data = requests.get(img_url, timeout=3).content
        with open(f'images/img_{i}.jpg', 'wb') as f:
            f.write(img_data)
        print(f"下载 {img_url}")
    except Exception as e:
        print(f"失败: {e}")

⚠️ 易错点避坑总结

序号	坑点描述	后果	解决方案
1	忘记设置User-Agent，使用默认python-requests	被服务器拒绝（403）	设置常见的浏览器User-Agent
2	未处理网络异常	程序崩溃	使用`try-except`捕获`RequestException`
3	请求频率过高	IP被封	使用`time.sleep()`添加延时
4	BeautifulSoup解析器未指定	可能使用不同解析器导致警告或解析差异	显式指定`'html.parser'`或`'lxml'`
5	定位元素时选择器过于绝对	网页结构变化后代码失效	尽量使用稳定的id、class，避免依赖位置索引
6	编码问题导致乱码	中文显示异常	根据网页响应头或meta标签设置`resp.encoding`
7	忘记处理重定向	可能抓取不到最终页面	Requests默认跟随重定向，无需处理
8	忽略robots.txt和访问频率	可能导致法律风险	检查robots.txt，设置合理延时
9	提取属性时使用`get('href')`而不是直接索引	属性不存在时抛出KeyError	使用`.get()`方法提供默认值
10	在循环中重复创建session	效率低，且不能复用连接	使用`requests.Session()`重用连接