分类导航

程序笔记发布时间：2022-07-05 发布网站：大佬教程 code.js-code.com

大佬教程收集整理的这篇文章主要介绍了你只认识大众汽车的车标怎么能行？赶紧用python采集所有车标学习一下，大佬教程大佬觉得挺不错的，现在分享给大家，也给大家做个参考。

本篇博客我们将学习如何通过 scrapy 批量下载文件࿰c;还能学习通过密码解压缩包？

目标站点分析

本次要采集的目标站点为：【车标网】࿰c;最终获取的数据是车标的的矢量图。

你只认识大众汽车的车标怎么能行？赶紧用python采集所有车标学习一下

在测试过程中发现࿰c;最后下载的压缩包存在解压密码࿰c;所以增加一个 通过密码解压缩的步骤。

素材的下载地址࿰c;比较容易获取到࿰c;通过地址 logo.php?id=119 修改为 download.php?id=119 即可实现。

其中车标列表页所在的页面无迭代规则࿰c;按照分类获取即可：

http://www.chebiao.net/domestic.php
http://www.chebiao.net/es.php
http://www.chebiao.net/jsk.php
http://www.chebiao.net/other.php
http://www.chebiao.net/famous.php

编写时间

scrapy 基本创建不在过多说明࿰c;spider 文件夹中的爬虫文件名为 cb.py࿰c;首先通过 start_urls 的设置࿰c;采集所有分类数据。

import scrapy
from chebiao.items import ChebiaoItem
from urllib.parse import urlparse


class CbSpider(scrapy.Spider):
    name = 'cb'
    allowed_domains = ['chebiao.net']
    start_urls = ['http://www.chebiao.net/domestic.php', 'http://www.chebiao.net/es.php',
                  'http://www.chebiao.net/jsk.php', 'http://www.chebiao.net/other.php',
                  'http://www.chebiao.net/famous.php']

接下来就是重点部分了࿰c;我们将启用 FilesPipeline 管道࿰c;用于实现对文件的下载。

使用该模块需要导入如下内容：

from scrapy.pipelines.files import FilesPipeline

FilesPipeline 类继承自 @H_852_17@mediaPipeline࿰c;打开类文件源码࿰c;在其中发现两个比较重要的类变量。

DEFAULT_FILES_URLS_FIELD = 'file_urls'
DEFAULT_FILES_RESULT_FIELD = 'files'

上述两个变量࿰c;如果不进行修改࿰c;后续在 items.py 文件中࿰c;必须要进行声明与赋值。

在查看源码的过程中࿰c;发现了很多可以在 setTings.py 文件初始化的配置࿰c;代码不在整体复制࿰c;本文仅用到了 FILES_STORE࿰c;即下图框选区域࿰c;该值表示文件存储路径。

你只认识大众汽车的车标怎么能行？赶紧用python采集所有车标学习一下

FilesPipeline 类中三个比较重要的方法是：

file_path：遍历 item 中的每一项࿰c;返回一个文件存储地址࿰c;函数原型如下所示： file_path(self, request, response=None, info=None, *, item=NonE)
get_media_requests：遍历 item 中的每一项࿰c;返回一个 request 请求࿰c;请求的结果会传递给 item_completed 方法；
item_completed：媒体请求完毕之后࿰c;返回数据到该方法࿰c;并且该方法需要返回 item 或者 drop item。

这里还要补充一个知识点࿰c;也是在翻阅 scrapy 源码的时候发现的࿰c;在 @H_852_17@mediaPipeline 类中存在如下私有方法。

def _make_compatible(self):
       """Make overridable methods of MediaPipeline and subclasses BACkWARDs compatible"""
       methods = [
           "file_path", "media_to_download", "media_downloaded",
           "file_downloaded", "image_downloaded", "get_images"
       ]

       for method_name in methods:
           method = getattr(self, method_name, None)
           if callable(@H_210_41@method):
               setattr(self, method_name, self._compatible(@H_210_41@method))

上述代码将 file_path 与 file_downloaded 方法进行了兼容࿰c;即 file_path 与 @H_852_17@media_to_download 和 @H_852_17@media_downloaded 功能一致࿰c;file_downloaded 与 image_downloaded 和 get_images 功能一致。

如果你在测试代码时࿰c;发现 file_path() 方法被执行了 3 次࿰c;该原因也是由于 scrapy 框架实现的࿰c;因为 file_path() 方法分别在 @H_852_17@media_to_download()、@H_852_17@media_downloaded()、file_downloaded() 三个方法中被调用过。继续深入学习࿰c;会发现上述 3 个方法的调用顺序࿰c;这里就不在过多展开说明。

通过单文件下载࿰c;测试 scrapy 下载文件流程

接下来通过一个文件࿰c;体验整体的下载流程࿰c;该地址为：http://www.chebiao.net/download.php?id=181。按照一般编写爬虫代码的顺序࿰c;修改如下文件代码： items.py 文件代码如下

import scrapy


class ChebiaoItem(scrapy.Item):
    file_url = scrapy.Field()
    file_name = scrapy.Field()
    files = scrapy.Field()

其中 file_url 表示请求的文件名࿰c;如果不希望自己扩展࿰c;应该用 file_urls 实现࿰c;file_name 为保存文件名࿰c;files 为默认字段࿰c;用于返回文件响应结果相关信息。

cb.py 文件代码如下

import scrapy
from chebiao.items import ChebiaoItem


class CbSpider(scrapy.Spider):
    name = 'cb'
    allowed_domains = ['chebiao.net']
    start_urls = ['http://www.chebiao.net/famous.php']

    def parse(self, response):
        item = ChebiaoItem()
        item['file_name'] = "测试" # 文件名
        item['file_url'] = "http://www.chebiao.net/download.php?id=181"
        yield item

其中 file_name 与 file_url 都为测试值。

setTings.py 文件设置爬虫相关配置

user_ageNT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36' # 用户代理

ROBOTSTXT_OBEY = false # 不访问 robot.txt 文件
DEFAULT_requEST_HEADERS = {@H_944_484@
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Host': 'www.chebiao.net',
    'Referer': 'http://www.chebiao.net/logo.php'
} # 请求头
# 开启数据管道࿰c;尤其注意 ChebiaoFilePipeline 为我们手动创建的管道类࿰c;代码在下文
ITEM_PIPELInes = {@H_944_484@
    'chebiao.pipelines.ChebiaoFilePipeline': 1,
    'chebiao.pipelines.ChebiaoPipeline': 300,
}
FILES_STORE = './files' # 文件存储路径
LOG_LEVEL = 'WARNING'  # 日志等级

配置中的 FILES_STORE 表示文件存储路径࿰c;设置绝对地址与相对地址都可。

pipelines.py 代码文件

from scrapy.pipelines.files import FilesPipeline
# 自定义类࿰c;继承自 FilesPipeline
class ChebiaoFilePipeline(FilesPipeline):
    def get_media_requests(self, item, info):
        print("--get_media_requests--start-----")
        print("正在下载：", item['file_name'])
        print("--get_media_requests--end-----")
        # 请求传递文件名参数
        yield request(item['file_url'], meta={@H_944_484@'title': item['file_name']})

    def file_path(self, request, response=None, info=None):
        print("--file_path--start-----")
        file_name = request.@H_210_41@meta.get('title') + ".rar"
        print("--file_path--end-----")
        return file_name

    def item_completed(self, results, item, info):
        print("--item_completed--start-----")
        print(results)
        print("下载完毕")
        print("--item_completed--end--------")
        return item

上述代码中的 item_completed() 方法࿰c;还有一层含义࿰c;是将数据返回到下一个要执行的管道类࿰c;例如在该方法中为下一个管道增加 items.py 中的参数࿰c;需提前在 items.py 中增加 file_paths 变量。

def item_completed(self, results, item, info):
    print("--item_completed--start-----")
    print(results)
    print("下载完毕")
    file_paths = [x['path'] for ok, x in results if ok]
    adapter = ItemAdapter(item)
    adapter['file_paths'] = file_paths
    print("--item_completed--end--------")
    return item

扩展为多文件下载

只需要修改 cb.py 文件即可实现多文件下载。

import scrapy
from chebiao.items import ChebiaoItem
from urllib.parse import urlparse


class CbSpider(scrapy.Spider):
    name = 'cb'
    allowed_domains = ['chebiao.net']
    start_urls = ['http://www.chebiao.net/domestic.php', 'http://www.chebiao.net/es.php',
                  'http://www.chebiao.net/jsk.php', 'http://www.chebiao.net/other.php',
                  'http://www.chebiao.net/famous.php']

    def parse(self, response):
        down_url = "http://www.chebiao.net/download.php"
        dds = response.xpath("//div[@class='box2']/dl/dd")
        for dd in dds:
            item = ChebiaoItem()
            name = dd.xpath('./a/text()').extract()[0]
            url = dd.xpath('./a/@href').extract()[0]
            url = down_url + "?" + urlparse(url).query
            item['file_name'] = name
            item['file_url'] = url

            yield item

现在我们已经可以下载全站的图片了࿰c;接下来实现解压缩文件。

解压文件

由于文件是 rar 格式࿰c;所以需要安装 unrar 模块࿰c;安装过程非常简单࿰c;使用 pip install unrar 即可࿰c;但是后续出现一些列问题࿰c;接下来为大家一一说明。

安装 unrar 之后࿰c;直接使用会出现找不到库的 BUG࿰c;需要下载 unrar library࿰c;打开 https://www.rarlab.com/rar_add.htm࿰c;找到 UnRAR.dll 下载即可࿰c;下载之后的文件为 UnRARDLl.exe࿰c;直接按照到默认路径࿰c;例如我本机为 Windows7 64 位操作系统࿰c;安装路径为 C:Program Files (x86)UnrarDLL。

下面要配置环境变量与文件名称。

配置 x64 目录到系统环境变量中；
修改 x64 目录中的 UnRaR64.dll 和 UnRAR64.lib 为全小写字母。

你只认识大众汽车的车标怎么能行？赶紧用python采集所有车标学习一下

上述内置配置完毕࿰c;由于修改了环境变量࿰c;为了让 pycharm 等开发工具加载࿰c;需要重启开发工具。

接下来在 ChebiaoPipeline 类中编写如下代码࿰c;重点注意解压部分相关代码。

import os
from unrar import rarfile


class ChebiaoPipeline:
    def process_item(self, item, spider):
    	# 获取文件完整路径
        path = os.path.join(os.getcwd(), "files", item['file_path'])
        # 解压路径
        extract_path = os.path.join(os.getcwd(), "files")
        # 由于文件设置了【加密文件名】࿰c;所以在加载文件时࿰c;也需要设置密码
        rf = rarfile.RarFile(path, pwd="www.chebiao.net")
        # 通过密码解压文件࿰c;解压目录为当前目录
        rf.extractall(path=extract_path, pwd="www.chebiao.net")
        return item