大佬教程收集整理的这篇文章主要介绍了Scrapy 只通过 next_page_url,大佬教程大佬觉得挺不错的,现在分享给大家,也给大家做个参考。
我的代码似乎只通过请求的前 5 个链接,然后在请求第 6 个链接时停止。我曾尝试使用 start_urls 和 next_page_url。两者都只摘自给出的前 5 页。
import scrapy
from scrapy.crawler import CrawlerProcess
import time
class finvizSpIDer(scrapy.SpIDer):
global tickers
global urlcheck
urlcheck = 1
tickers = []
name = "finviz"
start_urls = ["https://finviz.com/screener.ashx?v=111&f=cap_small,geo_usa,sh_avgvol_o300,sh_opt_option,sh_short_low&ft=4&o=change"]
def parse(self,responsE):
tickers.append(response.xpath('//a[@class="screener-link-priMary"]/text()').extract())
print(tickers)
next_page_url = "https://finviz.com/"
HTML = response.xpath(
'//a[@class="screener_arrow"]/@href').extract()[0]
print(HTML)
next_page_url += HTML
print(next_page_url)
if next_page_url is not None:
yIEld scrapy.request(next_page_url,callBACk=self.parsE)
def returnTickers(self):
newTickerList= []
for Lists in tickers:
if Lists:
for t in Lists:
newTickerList.append(t)
return newTickerList
错误说明如下:
感谢任何帮助。
编辑:
我已经更新了代码,但似乎仍然出错。
import scrapy
from scrapy.crawler import CrawlerProcess
import time
from bs4 import BeautifulSoup
class finvizSpIDer(scrapy.SpIDer):
global tickers
global urlcheck
urlcheck = 1
tickers = []
name = "finviz"
start_urls = [
"https://finviz.com/screener.ashx?v=111&f=cap_small,sh_short_low&ft=4&o=-change"]
def parse(self,url):
raw_HTML = scrapy.request(url)
good_HTML = BeautifulSoup(raw_HTML,'HTMl.parser')
first_part = "https://finviz.com/"
tickers.append([x.text for x in good_HTMl.findAll('a',{'class': 'screener-link-priMary'})])
second_part = good_HTMl.find('a',{'class': 'screener_arrow'})['href']
# check if there is next page
if second_part:
next_url = first_part + second_part
self.parse(next_url)
def returnTickers(self):
newTickerList= []
for Lists in tickers:
if Lists:
for t in Lists:
newTickerList.append(t)
return newTickerList
stock_List = finvizSpIDer()
process = CrawlerProcess()
process.crawl(finvizSpIDer)
process.start()
List2 = stock_List.returnTickers()
运行时出现以下错误。
看起来scrapy只能回调5次,所以我建议不要回调,我建议迭代一个包含所有链接的列表,你可以用BeautifulSoup来做,这会非常简单。
pip install BeautifulSoup4
from bs4 import BeautifulSoup
def parse(self,url):
raw_html = scrapy.request(url)
good_html = BeautifulSoup(raw_html,'html.parser')
first_part = "https://finviz.com/"
tickers.append([x.text for x in good_html.findAll('a',{'class':'screener-link-priMary'})])
second_part = good_html.find('a',{'class':'screener_arrow'})['href']
# check if there is next page
if second_part:
next_url = first_part + second_part
self.parse(next_url)
,
行errors='coerce'
永远不会是None,您需要检查html是否为None。
当html为None时,行if next_page_url is not None:
会给你一个错误,所以首先你需要检查它是否为None。
如果html为None,那么你不能做html[0],用extract_first替换extract(我用的是get)。
这是固定代码:
next_page_url += html
以上是大佬教程为你收集整理的Scrapy 只通过 next_page_url全部内容,希望文章能够帮你解决Scrapy 只通过 next_page_url所遇到的程序开发问题。
如果觉得大佬教程网站内容还不错,欢迎将大佬教程推荐给程序员好友。
本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。