分类导航

程序问答发布时间：2022-06-02 发布网站：大佬教程 code.js-code.com

大佬教程收集整理的这篇文章主要介绍了使用PostBack数据在页面中爬行javascript Python Scrapy，大佬教程大佬觉得挺不错的，现在分享给大家，也给大家做个参考。

如何解决使用PostBACk数据在页面中爬行javascript Python Scrapy？

开发过程中遇到使用PostBACk数据在页面中爬行javascript Python Scrapy的问题如何解决？下面主要结合日常开发的经验，给出你关于使用PostBACk数据在页面中爬行javascript Python Scrapy的解决方法建议，希望对你解决使用PostBACk数据在页面中爬行javascript Python Scrapy有所启发或帮助；

这种分页并不是看起来那么简单。解决它是一个有趣的挑战。以下是有关该解决方案的一些重要说明：

这里的想法是按照分页页面逐页在字典中的当前页面周围传递request.meta
使用常规，BaseSpider因为分页涉及一些逻辑
headers假装成为真正的浏览器很重要
产生Formrequests很重要，dont_filter=True因为我们基本上是POST向相同的URL发出请求，但参数不同

编码：

import re

from scrapy.http import Formrequest
from scrapy.spIDer import BaseSpIDer


headerS = {
    'X-MicrosoftAJAX': 'Delta=true',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
}
URL = 'http://exitrealty.com/agent_List.aspx?firstname=&lastname=&country=USA&state=NY'


class ExitRealtySpIDer(BaseSpIDer):
    name = "exit_realty"

    allowed_domains = ["exitrealty.com"]
    start_urls = [URL]

    def parse(self, responsE):
        # submit a form (first pagE)
        self.data = {}
        for form_input in response.CSS('form#aspnetForm input'):
            name = form_input.xpath('@name').extract()[0]
            try:
                value = form_input.xpath('@value').extract()[0]
            except IndexError:
                value = ""
            self.data[name] = value

        self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$updatePanel1|ctl00$MainContent$agentList'
        self.data['__EVENTTARGET'] = 'ctl00$MainContent$List'
        self.data['__EVENTARGUMENT'] = 'Page$1'

        return Formrequest(url=URL,
                           method='POST',
                           callBACk=self.parse_page,
                           formdata=self.data,
                           Meta={'page': 1},
                           dont_filter=True,
                           headers=headerS)

    def parse_page(self, responsE):
        current_page = response.Meta['page'] + 1

        # parse agents (Todo: yIEld items instead of prinTing)
        for agent in response.xpath('//a[@class="regtext"]/text()'):
            print agent.extract()
        print "------"

        # request the next page
        data = {
            '__EVENTARGUMENT': 'Page$%d' % current_page,
            '__EVENTVALIDATION': re.search(r"__EVENTVALIDATION\|(.*?)\|", response.body, re.MulTIliNE).group(1),
            '__VIEWSTATE': re.search(r"__VIEWSTATE\|(.*?)\|", response.body, re.MulTIliNE).group(1),
            '__ASYNCPOST': 'true',
            '__EVENTTARGET': 'ctl00$MainContent$agentList',
            'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$updatePanel1|ctl00$MainContent$agentList',
            '': ''
        }

        return Formrequest(url=URL,
                           method='POST',
                           formdata=data,
                           callBACk=self.parse_page,
                           Meta={'page': current_pagE},
                           dont_filter=True,
                           headers=headerS)