当前位置：首页 > news >正文

山东网站建设百度手机助手应用商店下载

news 2025/8/3 20:18:05

山东网站建设,百度手机助手应用商店下载,wordpress添加广告插件,知识产权代理目录：1 数据持久化存储，写入Mysql数据库①定义结构化字段：②重新编写爬虫文件：③编写管道文件：④辅助配置（修改settings.py文件）：⑤navicat创库建表：⑥ 效果如下&#xf…

1 数据持久化存储，写入Mysql数据库

①定义结构化字段：

（items.py文件的编写）：

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass NovelItem(scrapy.Item):'''匹配每个书籍URL并解析获取一些信息创建的字段'''# define the fields for your item here like:# name = scrapy.Field()category = scrapy.Field()book_name = scrapy.Field()author = scrapy.Field()status = scrapy.Field()book_nums = scrapy.Field()description = scrapy.Field()c_time = scrapy.Field()book_url = scrapy.Field()catalog_url = scrapy.Field()class ChapterItem(scrapy.Item):'''从每个小说章节列表页解析当前小说章节列表一些信息所创建的字段'''# define the fields for your item here like:# name = scrapy.Field()chapter_list = scrapy.Field()class ContentItem(scrapy.Item):'''从小说具体章节里解析当前小说的当前章节的具体内容所创建的字段'''# define the fields for your item here like:# name = scrapy.Field()content = scrapy.Field()chapter_url = scrapy.Field()

②重新编写爬虫文件：

（将解析的数据对应到字段里，并将其yield返回给管道文件pipelines.py）

# -*- coding: utf-8 -*-
import datetimeimport scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rulefrom ..items import NovelItem,ChapterItem,ContentItemclass Bh3Spider(CrawlSpider):name = 'zh'allowed_domains = ['book.zongheng.com']start_urls = ['https://book.zongheng.com/store/c0/c0/b0/u1/p1/v0/s1/t0/u0/i1/ALL.html']rules = (# Rule定义爬取规则： 1.提取url（LinkExtractor对象）   2.形成请求    3.响应的处理规则# 源码：Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)# 1.LinkExractor是scrapy框架定义的一个类，它定义如何从每个已爬网页面中提取url链接,并将这些url作为新的请求发送给引擎# 引擎经过一系列操作后将response给到callback所指的回调函数。# allow=r'Items/'的意思是提取链接的正则表达式   【相当于findall(r'Items/',response.text)】# 2.callback='parse_item'是指定回调函数。# 3.follow=True的作用：LinkExtractor提取到的url所生成的response在给callback的同时，还要交给rules匹配所有的Rule规则（有几条遵循几条）# 拿到了书籍的url                                  回调函数                            process_links用于处理LinkExtractor匹配到的链接的回调函数# 匹配每个书籍的urlRule(LinkExtractor(allow=r'https://book.zongheng.com/book/\d+.html',restrict_xpaths=("//div[@class='bookname']")), callback='parse_book', follow=True,process_links="process_booklink"),# 匹配章节目录的urlRule(LinkExtractor(allow=r'https://book.zongheng.com/showchapter/\d+.html',restrict_xpaths=('//div[@class="fr link-group"]')), callback='parse_catalog', follow=True),# 章节目录的url生成的response，再来进行具体章节内容的url的匹配     之后此url会形成response，交给callback函数Rule(LinkExtractor(allow=r'https://book.zongheng.com/chapter/\d+/\d+.html',restrict_xpaths=('//ul[@class="chapter-list clearfix"]')), callback='get_content',follow=False, process_links="process_chapterlink"),# restrict_xpaths是LinkExtractor里的一个参数。作用：过滤（对前面allow匹配到的url进行区域限制），只允许此参数匹配的allow允许的url通过此规则！！！)def process_booklink(self, links):for index, link in enumerate(links):# 限制一本书if index == 0:print("限制一本书：", link.url)yield linkelse:returndef process_chapterlink(self, links):for index,link in enumerate(links):#限制21章内容if index<=20:print("限制20章内容：",link.url)yield linkelse:returndef parse_book(self, response):print("解析book_url")# 字数：book_nums = response.xpath('//div[@class="nums"]/span/i/text()').extract()[0]# 书名：book_name = response.xpath('//div[@class="book-name"]/text()').extract()[0].strip()category = response.xpath('//div[@class="book-label"]/a/text()').extract()[1]author = response.xpath('//div[@class="au-name"]/a/text()').extract()[0]status = response.xpath('//div[@class="book-label"]/a/text()').extract()[0]description = "".join(response.xpath('//div[@class="book-dec Jbook-dec hide"]/p/text()').extract())c_time = datetime.datetime.now()book_url = response.urlcatalog_url = response.css("a").re("https://book.zongheng.com/showchapter/\d+.html")[0]item=NovelItem()item["category"]=categoryitem["book_name"]=book_nameitem["author"]=authoritem["status"]=statusitem["book_nums"]=book_numsitem["description"]=descriptionitem["c_time"]=c_timeitem["book_url"]=book_urlitem["catalog_url"]=catalog_urlyield itemdef parse_catalog(self, response):print("解析章节目录", response.url)  # response.url就是数据的来源的url# 注意：章节和章节的url要一一对应a_tags = response.xpath('//ul[@class="chapter-list clearfix"]/li/a')chapter_list = []for index, a in enumerate(a_tags):title = a.xpath("./text()").extract()[0]chapter_url = a.xpath("./@href").extract()[0]ordernum = index + 1c_time = datetime.datetime.now()catalog_url = response.urlchapter_list.append([title, ordernum, c_time, chapter_url, catalog_url])item=ChapterItem()item["chapter_list"]=chapter_listyield itemdef get_content(self, response):content = "".join(response.xpath('//div[@class="content"]/p/text()').extract())chapter_url = response.urlitem=ContentItem()item["content"]=contentitem["chapter_url"]=chapter_urlyield item

③编写管道文件：

（pipelines.py文件）
数据存储到MySql数据库分三步走：
①存储小说信息；
②存储除了章节具体内容以外的章节信息（因为：首先章节信息是有序的；其次章节具体内容是在一个新的页面里，需要发起一次新的请求）；
③更新章节具体内容信息到第二步的表中。

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymysql
import logging
from .items import NovelItem,ChapterItem,ContentItem
logger=logging.getLogger(__name__)      #生成以当前文件名命名的logger对象。 用日志记录报错。class ZonghengPipeline(object):def open_spider(self,spider):# 连接数据库data_config = spider.settings["DATABASE_CONFIG"]if data_config["type"] == "mysql":self.conn = pymysql.connect(**data_config["config"])self.cursor = self.conn.cursor()def process_item(self, item, spider):# 写入数据库if isinstance(item,NovelItem):#写入书籍信息sql="select id from novel where book_name=%s and author=%s"self.cursor.execute(sql,(item["book_name"],item["author"]))if not self.cursor.fetchone():          #.fetchone()获取上一个查询结果集。在python中如果没有则为Nonetry:#如果没有获得一个id，小说不存在才进行写入操作sql="insert into novel(category,book_name,author,status,book_nums,description,c_time,book_url,catalog_url)"\"values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"self.cursor.execute(sql,(item["category"],item["book_name"],item["author"],item["status"],item["book_nums"],item["description"],item["c_time"],item["book_url"],item["catalog_url"],))self.conn.commit()except Exception as e:          #捕获异常并日志显示self.conn.rollback()logger.warning("小说信息错误!url=%s %s")%(item["book_url"],e)return itemelif isinstance(item,ChapterItem):#写入章节信息try:sql="insert into chapter (title,ordernum,c_time,chapter_url,catalog_url)"\"values(%s,%s,%s,%s,%s)"#注意：此处item的形式是！  item["chapter_list"]====[(title,ordernum,c_time,chapter_url,catalog_url)]chapter_list=item["chapter_list"]self.cursor.executemany(sql,chapter_list)     #.executemany()的作用：一次操作，写入多个元组的数据。形如：.executemany(sql,[(),()])self.conn.commit()except Exception as e:self.conn.rollback()logger.warning("章节信息错误!%s"%e)return itemelif isinstance(item,ContentItem):try:sql="update chapter set content=%s where chapter_url=%s"content=item["content"]chapter_url=item["chapter_url"]self.cursor.execute(sql,(content,chapter_url))self.conn.commit()except Exception as e:self.conn.rollback()logger.warning("章节内容错误!url=%s %s") % (item["chapter_url"], e)return itemdef close_spider(self,spider):# 关闭数据库self.cursor.close()self.conn.close()

④辅助配置（修改settings.py文件）：

第一个：关闭robots协议；
第二个：开启延迟；
第三个：加入头文件；
第四个：开启管道：

在这里插入图片描述

第五个：配置连接Mysql数据库的参数：

DATABASE_CONFIG={"type":"mysql","config":{"host":"localhost","port":3306,"user":"root","password":"123456","db":"zongheng","charset":"utf8"}
}

⑤navicat创库建表：

（1）创库：

在这里插入图片描述

（2）建表：（注意：总共需要建两张表！）

存储小说基本信息的表，表名为novel
存储小说具体章节内容的表，表名为chapter:

注意id不要忘记设自增长了！

⑥ 效果如下：

在这里插入图片描述

拓展操作：
如果来回调试有问题的话，需要删除表中所有数据重新爬取，直接使用navicate删除表中所有数据（即delete操作），那么id自增长就不会从1开始了。
这该咋办呢？

文章转载自：
http://lessness.c7496.cn
http://alarmist.c7496.cn
http://oniomania.c7496.cn
http://programable.c7496.cn
http://spizzerinctum.c7496.cn
http://mesmerisation.c7496.cn
http://thionin.c7496.cn
http://platysma.c7496.cn
http://torah.c7496.cn
http://rigaudon.c7496.cn
http://interferometry.c7496.cn
http://syringa.c7496.cn
http://elliptical.c7496.cn
http://noma.c7496.cn
http://celebrative.c7496.cn
http://depressible.c7496.cn
http://fidge.c7496.cn
http://anoopsia.c7496.cn
http://autoflare.c7496.cn
http://unreachable.c7496.cn
http://prophesy.c7496.cn
http://cacafuego.c7496.cn
http://perfluorochemical.c7496.cn
http://obliviscence.c7496.cn
http://swindle.c7496.cn
http://excurse.c7496.cn
http://huzzy.c7496.cn
http://compliantly.c7496.cn
http://sharrie.c7496.cn
http://gyropilot.c7496.cn
http://creeper.c7496.cn
http://telebit.c7496.cn
http://adjoint.c7496.cn
http://seven.c7496.cn
http://holoscopic.c7496.cn
http://taylor.c7496.cn
http://gusla.c7496.cn
http://airflow.c7496.cn
http://ganof.c7496.cn
http://narrowness.c7496.cn
http://marxist.c7496.cn
http://ostomy.c7496.cn
http://interamnian.c7496.cn
http://kamaishi.c7496.cn
http://zila.c7496.cn
http://oleaginous.c7496.cn
http://imponent.c7496.cn
http://creativity.c7496.cn
http://bloodmobile.c7496.cn
http://abborrent.c7496.cn
http://hilltop.c7496.cn
http://reveller.c7496.cn
http://vizor.c7496.cn
http://catchweed.c7496.cn
http://resterilize.c7496.cn
http://placeman.c7496.cn
http://cote.c7496.cn
http://unbreathable.c7496.cn
http://nail.c7496.cn
http://olfactometer.c7496.cn
http://cochlea.c7496.cn
http://quiesce.c7496.cn
http://parkinsonism.c7496.cn
http://paraceisian.c7496.cn
http://challie.c7496.cn
http://grabber.c7496.cn
http://sulfury.c7496.cn
http://picotite.c7496.cn
http://postflight.c7496.cn
http://flash.c7496.cn
http://moon.c7496.cn
http://sinify.c7496.cn
http://foochow.c7496.cn
http://benadryl.c7496.cn
http://memotron.c7496.cn
http://treblinka.c7496.cn
http://immie.c7496.cn
http://spirocheticide.c7496.cn
http://commodiously.c7496.cn
http://viscosity.c7496.cn
http://nebulose.c7496.cn
http://ultrared.c7496.cn
http://filigree.c7496.cn
http://beetleheaded.c7496.cn
http://paronym.c7496.cn
http://barcelona.c7496.cn
http://discommend.c7496.cn
http://rebound.c7496.cn
http://ichthyolitic.c7496.cn
http://asway.c7496.cn
http://leapingly.c7496.cn
http://gelatinous.c7496.cn
http://oracular.c7496.cn
http://taffrail.c7496.cn
http://resume.c7496.cn
http://isodynamicline.c7496.cn
http://areaway.c7496.cn
http://panful.c7496.cn
http://keresan.c7496.cn
http://voidance.c7496.cn

查看全文

http://www.zhongyajixie.com/news/79186.html

建设游戏网站需要哪些设备济南做网站公司哪家好

济南微网站搜索引擎seo关键词优化方法

零基础做动态网站需要多久百度指数查询官网入口登录

wordpress 企业站外贸推广营销公司

网站服务器配置如何让百度快速收录

000webhost wordpress杭州百度seo代理

兰州企业网站建设多少钱竞价恶意点击立案标准

网站建设公司青岛郑州网站运营专业乐云seo

建筑工程管理软件网站seo检测

专业的河南网站建设公司口碑优化

wordpress 访问空白页好的seo公司营销网

网页设计与制作教程ppt免费下载seo关键词查询排名软件

类似酷家乐做庭院的网站小红书信息流广告投放

目录：

1 数据持久化存储，写入Mysql数据库

①定义结构化字段：

②重新编写爬虫文件：

③编写管道文件：

④辅助配置（修改settings.py文件）：

⑤navicat创库建表：

⑥ 效果如下：

相关文章：