前言
现在很多网站都是在浏览器中使用js动态渲染页面,直接意味着无法直接从原始页面中爬取到数据。
所以这里就使用可以提供js渲染解析功能的Scrapy-Splash
一、安装Scrapy-Splash
首先,要明白一点,Scrapy-Splash是需要在docker中使用的,所以前期工作得做好
1. 安装docker
https://blog.csdn.net/Eternal_Blue/article/details/96855986
2. docker安装成功后,在docker中安装scrapy-splash,执行该命令:
docker run -d -p 8050:8050 scrapinghub/splash
3. 安装成功,再从浏览器上测试一下,ip为安装服务器的路径。如果能打开如图所示的页面,就意味着安装成功了
http://192.168.1.104:8050

二、具体代码
pip install scrapy-splash
from scrapy_splash import SplashRequest
import time
class mainSpider(scrapy.Spider):
name = "test"
start_urls = ['https://chp.shadiao.app/']
def __init__(self):
self.script = """
function main(splash)
splash:set_viewport_size(1028, 10000)
splash:go(splash.args.url)
local scroll_to = splash:jsfunc("window.scrollTo")
scroll_to(0, 5000)
splash:wait(15)
return {
html = splash:html()
}
end
"""
self.splash_args = {"lua_source": """
--splash.response_body_enabled = true
splash.private_mode_enabled = false
splash:set_user_agent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36")
splash:wait(3)
return {html = splash:html()}
"""}
def start_requests(self):
time.sleep(5)
try:
url="https://chp.shadiao.app/"
yield SplashRequest(url=url, callback=self.parse, meta={'dont_redirect': True, 'splash': {
'args': {'lua_source': self.script, 'images': 0},
'endpoint': 'execute',
}}, args=self.splash_args, endpoint='render.html')
except Exception as e:
pass
def parse(self,response):
pass
time.sleep(7)
url = "https://chp.shadiao.app/"
content = ''.join(response.xpath('//*[@id="txt_chp"]//text()').extract())
print(content)
yield SplashRequest(url, callback=self.parse, args=self.splash_args, endpoint='render.html', dont_filter=True)
三、配置文件
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'crawlerDemo.middlewares.CrawlerdemoDownloaderMiddleware': 1,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
#splash
SPLASH_URL = 'http://xxx.xxx.xxx.xxx:8050'

四、这样执行scrapy项目时,就使用Splash进行渲染了
本文介绍了如何使用Scrapy结合Splash来爬取JavaScript动态渲染的网页内容。首先,详细说明了安装Scrapy-Splash的步骤,包括安装docker和在docker中部署Scrapy-Splash。接着,文章提到了具体代码实现和配置文件的设置,以确保Scrapy项目能够利用Splash进行页面渲染。通过这种方式,可以有效地抓取到原本因动态加载而无法获取的数据。
2962

被折叠的 条评论
为什么被折叠?



