爬虫xx网站论坛的帖子源码分享

最新推荐文章于 2026-06-14 13:44:50 发布

原创最新推荐文章于 2026-06-14 13:44:50 发布 · 9k 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

django 专栏收录该内容

31 篇文章

订阅专栏

此博客展示了一段Python代码，通过`urllib`和`requests`库批量获取帖子信息。代码从指定的起始ID到结束ID遍历，访问帖子URL，读取页面内容。若页面包含特定错误信息则跳过，否则使用正则表达式提取帖子标题、内容、作者ID和论坛ID，后续可存入数据库。

Python3.8

Python 是一种高级、解释型、通用的编程语言，以其简洁易读的语法而闻名，适用于广泛的应用，包括Web开发、数据分析、人工智能和自动化脚本

import re
import time
from urllib import parse
import urllib

import requests

def updatepostinfo(startid,endid):
for num in range(int(startid),int(endid)+1):
time.sleep(2)
print(num)
posturl="http:网站/detailnew.php?id="+str(num)
postres=urllib.request.urlopen(posturl)
postres=postres.read().decode()
print(postres)
if "帖子不存在" in postres:
continue
elif "错误,帖子" in postres:
continue
elif "该论坛不存在" in postres:
continue
elif "访问错误" in postres:
continue
elif "此帖审核中" in postres:
continue
elif "分版权限" in postres:
continue
else:
posttitle=re.findall(r'<card id=\"main\" title=\"(.+?)\">',str(postres))
postcontent=re.findall(r'name=\"content\" value=\"(.+?)\" />',str(postres))
postcontent=postcontent[0]
postauthid=re.findall(r"userid=(\d+)",str(postres))
postauthid=postauthid[0]
postforumid=re.findall(r"bid=(\d+)",str(postres))
postforumid=postforumid[0]
#下面存入对应字段到数据库