大家好,又见面了,我是你们的朋友全栈君。如果您正在找激活码,请点击查看最新教程,关注关注公众号 “全栈程序员社区” 获取激活教程,可能之前旧版本教程已经失效.最新Idea2022.1教程亲测有效,一键激活。
Jetbrains全系列IDE使用 1年只要46元 售后保障 童叟无欺
。。。闲来无事,爬了一下我最爱的B站~~~卒
首先进入B站的番剧索引页
ps:以前经常浏览这个索引页找动漫看,所以熟练的操作~滑稽
翻页发现url链接并没有改变,用谷歌开发者工具network发现加载了XHR文件并返回json格式的响应
放到atom里看下数据是咋样的
要对其进行翻页处理,观察一下query string的规律,发现那么多个参数只有page这个参数是变化的
所以接下来都很好做了~嘻嘻
items.py
import scrapy
from scrapy import Field
class BilibiliItem(scrapy.Item):
title = Field()
cover = Field()
sum_index = Field()
is_finish = Field()
link = Field()
follow = Field()
plays = Field()
score = Field()
_id = Field()
import scrapy
import demjson #这个库要pip一哈
from scrapy.selector import Selector
from bilibili.items import BilibiliItem
from random import randint
class BzhanSpider(scrapy.Spider):
name = 'bzhan'
allowed_domains = ['bilibili.com']
start_urls = ['https://bangumi.bilibili.com/media/web_api/search/result?season_version=-1&area=-1&is_finish=-1©right=-1&season_status=-1&season_month=-1&pub_date=-1&style_id=-1&order=3&st=1&sort=0&page=1&season_type=1&pagesize=20']
def parse(self, response):
json_content = demjson.decode(response.body)
datas = json_content["result"]["data"]
item = BilibiliItem()
for data in datas:
cover = data['cover']
sum_index = data['index_show']
is_finish = data['is_finish']
is_finish = '已完结' if is_finish == 1 else '未完结'
link = data['link']
follow = data['order']['follow']
plays = data['order']['play']
try:
score = data['order']['score']
except:
score = '未知'
title = data['title']
item['_id'] = title
item['cover'] = cover
item['sum_index'] = sum_index
item['is_finish'] = is_finish
item['link'] = link
item['follow'] = follow
item['plays'] = plays
item['score'] = score
item['title'] = title
yield item
urls = ['https://bangumi.bilibili.com/media/web_api/search/result?season_version=-1&area=-1&is_finish=-1©right=-1&season_status=-1&season_month=-1&pub_date=-1&style_id=-1&order=3&st=1&sort=0&page={0}&season_type=1&pagesize=20'.format(k) for k in range(2,156)]
for url in urls:
request = scrapy.Request(url,callback=self.parse)
yield request
利用python对象字典的方式进行解析。。不难
import pymongo
class BilibiliPipeline(object):
def process_item(self, item, spider):
client = pymongo.MongoClient('localhost', 27017)
mydb = client['mydb']
bilibili = mydb['bilibili']
bilibili.insert_one(item)
print(item)
return item
settings.py略。。。。。。
结果可以爬取到三千多个数据
心疼我的b站一秒。。
发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/172278.html原文链接:https://javaforall.cn
【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛
【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...