程序问答   发布时间:2022-06-01  发布网站:大佬教程  code.js-code.com
大佬教程收集整理的这篇文章主要介绍了Scrapy 为什么我的蜘蛛不跟随下一页?大佬教程大佬觉得挺不错的,现在分享给大家,也给大家做个参考。

如何解决Scrapy 为什么我的蜘蛛不跟随下一页??

开发过程中遇到Scrapy 为什么我的蜘蛛不跟随下一页?的问题如何解决?下面主要结合日常开发的经验,给出你关于Scrapy 为什么我的蜘蛛不跟随下一页?的解决方法建议,希望对你解决Scrapy 为什么我的蜘蛛不跟随下一页?有所启发或帮助;

我的蜘蛛没有水平爬行,我不知道为什么。

parse_item 函数在第一页上运行良好。我在scrapy sHell 中检查了next_page 的xpath,它是正确的。

你能检查一下我的代码吗?

我要抓取的网站是 this

import scrapy
import datetiR_127_11845@e
import socket

from scrapy.linkextractors import linkExtractor
from scrapy.spIDers import CrawlSpIDer,Rule
from scrapy.loader import Itemloader
from itemloaders.processors import MapCompose
from propertIEs.items import PropertIEsItem


class EasySpIDer(CrawlSpIDer):
    name = 'easy'
    allowed_domains = ['www.vivareal.com.br']
    start_urls = ['https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/']

    next_page = '//lI[@class="pagination__item"][last()]'

    rules = (
        Rule(linkExtractor(reStrict_xpaths=next_pagE)),Rule(linkExtractor(allow=r'/imovel/',deny=r'/imoveis-lancamento/'),callBACk='parse_item'),)

    def parse_item(self,responsE):
        l = Itemloader(item=PropertIEsItem(),response=responsE)
        l.add_xpath('url','a/@href',)
        l.add_xpath('tipo','//h1/text()',MapCompose(lambda x: x.Strip().split()[0]))
        l.add_xpath('valor','//h3[@class="price__price-info Js-price-SALE"]/text()',MapCompose(lambda x: x.Strip().replace('R$ ','').replace('.',''),float))
        l.add_xpath('condominio','//span[@class="price__List-value condominium Js-condominium"]/text()',float))
        l.add_xpath('endereco','//p[@class="title__address Js-address"]/text()',MapCompose(lambda x: x.split(' - ')[0]))
        l.add_xpath('bairro',MapCompose(lambda x: x.split(' - ')[1].split(',')[0]))
        l.add_xpath('quartos','//ul[@class="features"]/lI[@title="Quartos"]/span/text()',MapCompose(lambda x: x.Strip(),@R_618_10185@)
        l.add_xpath('banheiros','//ul[@class="features"]/lI[@title="Banheiros"]/span/text()',@R_618_10185@)
        l.add_xpath('vagas','//ul[@class="features"]/lI[@title="Vagas"]/span/text()',@R_618_10185@)
        l.add_xpath('area','//ul[@class="features"]/lI[@title="Área"]/span/text()',float))
        l.add_value('url',response.url)
        
        # HousekeePing fIElds
        l.add_value('project',self.setTings.get('BOt_name'))
        l.add_value('spIDer',self.Name)
        l.add_value('server',socket.gethostname())
        l.add_value('date',datetiR_127_11845@e.datetiR_127_11845@e.Now())
        
        return l.load_item()

更新

搜索日志我发现这个关于水平爬行:

2021-02-22 17:09:24 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 17:09:24 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

好像下一页重复了,但我不知道如何解决。

另外,我意识到,尽管 href 指向 #pagina=2,但实际的 url 是 ?pagina=2

有什么提示吗?

解决方法

实际上,您的蜘蛛甚至连第一页都没有爬行。

问题出在 allowed_domains 参数中。改成

allowed_domains = ['www.vivareal.com.br']

然后您将开始爬行。更改之后,您将收到很多错误(由于逻辑错误,代码抛出异常,正如我在此处看到的),但您的代码将按预期运行。

编辑 (2)

检查日志:

2021-02-22 13:36:19 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.vivareal.com.br': <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2>

基本上 allowed_domains 没有正确设置,如 here 和 this old question 所述。

编辑: 使其清楚:按照问题的定义运行蜘蛛后我得到的日志是:


2021-02-22 13:29:18 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: properties)
2021-02-22 13:29:18 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0,libxml2 2.9.10,cssSELEct 1.1.0,parsel 1.6.0,w3lib 1.22.0,Twisted 20.3.0,Python 3.8.1 (default,Feb  9 2020,21:34:32) - [GCC 7.4.0],pyOpenSSL 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020),cryptography 3.3.1,Platform Linux-4.15.0-135-generic-x86_64-with-glibc2.27
2021-02-22 13:29:18 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-02-22 13:29:18 [scrapy.crawler] INFO: Overridden setTings:
{'BOt_name': 'properties','NEWSPIDER_MODULE': 'properties.spiders','ROBOTSTXT_OBEY': True,'SPIDER_MODULES': ['properties.spiders']}
2021-02-22 13:29:18 [scrapy.extensions.telnet] INFO: Telnet password: 3790c3525890efea
2021-02-22 13:29:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats']
2021-02-22 13:29:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.httpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewareS.Useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.Retrymiddleware','scrapy.downloadermiddlewares.redirect.MetarefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.httpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.httpProxymiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-02-22 13:29:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.httpErrorMiddleware','scrapy.spidermiddlewares.offsite.offsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewareS.Urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-02-22 13:29:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-02-22 13:29:18 [scrapy.core.ENGIne] INFO: Spider opened
2021-02-22 13:29:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2021-02-22 13:29:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-22 13:29:19 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/robots.txt> (referer: NonE)
2021-02-22 13:29:20 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/> (referer: NonE)
2021-02-22 13:29:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.vivareal.com.br': <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2>
2021-02-22 13:29:20 [scrapy.core.ENGIne] INFO: Closing spider (finished)
2021-02-22 13:29:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/requesT_Bytes': 606,'downloader/request_count': 2,'downloader/request_method_count/GET': 2,'downloader/response_bytes': 156997,'downloader/response_count': 2,'downloader/response_status_count/200': 2,'elapsed_time_seconds': 1.87473,'finish_reason': 'finished','finish_time': datetiR_127_11845@e.datetiR_127_11845@e(2021,2,22,16,29,20,372722),'log_count/DEBUG': 3,'log_count/INFO': 10,'memusage/max': 54456320,'memusage/startup': 54456320,'offsite/domains': 1,'offsite/filtered': 34,'request_depth_max': 1,'response_received_count': 2,'robotstxt/request_count': 1,'robotstxt/response_count': 1,'robotstxt/response_status_count/200': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetiR_127_11845@e.datetiR_127_11845@e(2021,18,497992)}
2021-02-22 13:29:20 [scrapy.core.ENGIne] INFO: Spider closed (finished)

当我运行建议的更改时,日志是这样的(适合不显示我的路径):

2021-02-22 13:31:47 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: properties)
2021-02-22 13:31:47 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0,Platform Linux-4.15.0-135-generic-x86_64-with-glibc2.27
2021-02-22 13:31:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-02-22 13:31:47 [scrapy.crawler] INFO: Overridden setTings:
{'BOt_name': 'properties','SPIDER_MODULES': ['properties.spiders']}
2021-02-22 13:31:47 [scrapy.extensions.telnet] INFO: Telnet password: 65a5f31c8dda80fa
2021-02-22 13:31:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.logstats.LogStats']
2021-02-22 13:31:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-02-22 13:31:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.httpErrorMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-02-22 13:31:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-02-22 13:31:47 [scrapy.core.ENGIne] INFO: Spider opened
2021-02-22 13:31:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2021-02-22 13:31:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-22 13:31:49 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/robots.txt> (referer: NonE)
2021-02-22 13:31:49 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/> (referer: NonE)
2021-02-22 13:31:49 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-02-22 13:31:50 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-1-quartos-funcionarios-bairros-belo-horizonte-com-garagem-41m2-venda-RS330000-id-2510414426/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-3-quartos-nova-granada-bairros-belo-horizonte-com-garagem-74m2-venda-RS499000-id-2509923918/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-4-quartos-serra-bairros-belo-horizonte-com-garagem-246m2-venda-RS1950000-id-2510579983/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/casa-3-quartos-sao-geraldo-bairros-belo-horizonte-com-garagem-120m2-venda-RS460000-id-2484383176/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-4-quartos-savassi-bairros-belo-horizonte-com-garagem-206m2-venda-RS1790000-id-2503711314/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-2-quartos-paqueta-bairros-belo-horizonte-com-garagem-60m2-venda-RS260000-id-2479637684/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.ENGIne] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-3-quartos-savassi-bairros-belo-horizonte-com-garagem-107m2-venda-RS1250000-id-2506122689/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.vivareal.com.br/imovel/apartamento-1-quartos-funcionarios-bairros-belo-horizonte-com-garagem-41m2-venda-RS330000-id-2510414426/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
TraceBACk (most recent call last):
  File "/usr/lib/python3.8/site-packages/scrapy/utils/defer.py",line 120,in iter_errBACk
    yield next(it)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py",line 353,in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py",in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py",line 62,in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py",line 29,in process_spider_output
    for x in result:
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py",in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py",line 340,in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py",in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py",line 37,in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py",in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py",line 58,in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spiders/crawl.py",line 114,in _parse_response
    cb_res = callBACk(response,**cb_kwargs) or ()
  File "/home/LeoR_127_11845@affei/properties/properties/spiders/spider.py",line 28,in parse_item
    l.add_xpath('url','a/@href',)
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py",line 350,in add_xpath
    self.add_value(field_name,values,*processors,**kw)
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py",line 190,in add_value
    self._add_value(field_name,value)
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py",line 208,in _add_value
    processed_value = self._process_input_value(field_name,line 312,in _process_input_value
    proc = self.get_input_processor(field_Name)
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py",line 290,in get_input_processor
    proc = self._get_item_field_attr(
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py",line 308,in _get_item_field_attr
    field_meta = ItemAdapter(self.item).get_field_meta(field_Name)
  File "/usr/lib/python3.8/site-packages/itemadapter/adapter.py",line 235,in get_field_meta
    return self.adapter.get_field_meta(field_Name)
  File "/usr/lib/python3.8/site-packages/itemadapter/adapter.py",line 161,in get_field_meta
    return MappingProxyType(self.item.fields[field_name])
KeyError: 'url'
2021-02-22 13:31:50 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.vivareal.com.br/imovel/apartamento-3-quartos-nova-granada-bairros-belo-horizonte-com-garagem-74m2-venda-RS499000-id-2509923918/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
TraceBACk (most recent call last):
  File "/usr/lib/python3.8/site-packages/scrapy/utils/defer.py",in <genexpr>
    return (r for r in result or () if _filte
...

大佬总结

以上是大佬教程为你收集整理的Scrapy 为什么我的蜘蛛不跟随下一页?全部内容,希望文章能够帮你解决Scrapy 为什么我的蜘蛛不跟随下一页?所遇到的程序开发问题。

如果觉得大佬教程网站内容还不错,欢迎将大佬教程推荐给程序员好友。

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。
标签: