大佬教程收集整理的这篇文章主要介绍了在没有硬编码有效负载的情况下无法从一个部分中抓取所有书籍,大佬教程大佬觉得挺不错的,现在分享给大家,也给大家做个参考。
我创建了一个脚本来从 such pages 的 Customers who bought this item also bought
部分下抓取不同书籍的名称。单击右箭头按钮后,您可以找到所有相关书籍。我在脚本中使用了两个不同的书籍链接来查看脚本的行为。
我在发布请求中使用的负载是硬编码的,用于 product_links
的第一个链接。负载似乎在页面源中可用,但 I can't find the right way to use it automatically
。当我使用另一个书籍链接时,payload 中有几个 ID 可能不相同,因此硬性 payload 似乎不是一个好主意。
我尝试过:
import requests
from bs4 import BeautifulSoup
product_links = [
'@R_772_10107@s://www.amazon.com/Essential-Keto-DIEt-Beginners-2019/dp/1099697018/','@R_772_10107@s://www.amazon.com/Keto-Cookbook-Beginners-Low-Carb-Homemade/dp/B08QFBMSFT/'
]
url = '@R_772_10107@s://www.amazon.com/acp/p13n-desktop-carousel/funjjvdbohwkuezi/getCarouselitems'
payload = {"aCarouselOptions":"{\"@R_502_6815@\":{\"ID_List\":[\"{\\\"ID\\\":\\\"B07NYZJX2L\\\"}\",\"{\\\"ID\\\":\\\"1939754445\\\"}\",\"{\\\"ID\\\":\\\"1792145454\\\"}\",\"{\\\"ID\\\":\\\"1073560988\\\"}\",\"{\\\"ID\\\":\\\"1119578922\\\"}\",\"{\\\"ID\\\":\\\"B083K5RRSG\\\"}\",\"{\\\"ID\\\":\\\"B07SPSXHZ8\\\"}\",\"{\\\"ID\\\":\\\"B08GG2RL1D\\\"}\",\"{\\\"ID\\\":\\\"1507212305\\\"}\",\"{\\\"ID\\\":\\\"B08QFBMSFT\\\"}\",\"{\\\"ID\\\":\\\"164152247X\\\"}\",\"{\\\"ID\\\":\\\"1673455980\\\"}\",\"{\\\"ID\\\":\\\"B084DD8WHP\\\"}\",\"{\\\"ID\\\":\\\"1706342667\\\"}\",\"{\\\"ID\\\":\\\"1628603135\\\"}\",\"{\\\"ID\\\":\\\"B08NZV2Z4N\\\"}\",\"{\\\"ID\\\":\\\"1942411294\\\"}\",\"{\\\"ID\\\":\\\"1507209924\\\"}\",\"{\\\"ID\\\":\\\"1641520434\\\"}\",\"{\\\"ID\\\":\\\"B084Z7627Q\\\"}\",\"{\\\"ID\\\":\\\"B08NRXFZ98\\\"}\",\"{\\\"ID\\\":\\\"1623159326\\\"}\",\"{\\\"ID\\\":\\\"B0827dhlR6\\\"}\",\"{\\\"ID\\\":\\\"B08TL5W56Z\\\"}\",\"{\\\"ID\\\":\\\"1941169171\\\"}\",\"{\\\"ID\\\":\\\"1645670945\\\"}\",\"{\\\"ID\\\":\\\"B08GLSSNKF\\\"}\",\"{\\\"ID\\\":\\\"B08RR4RJHB\\\"}\",\"{\\\"ID\\\":\\\"B07WRQ4CF4\\\"}\",\"{\\\"ID\\\":\\\"B08Y49Z3V1\\\"}\",\"{\\\"ID\\\":\\\"B08LNX32ZL\\\"}\",\"{\\\"ID\\\":\\\"1250621097\\\"}\",\"{\\\"ID\\\":\\\"1628600071\\\"}\",\"{\\\"ID\\\":\\\"1646115511\\\"}\",\"{\\\"ID\\\":\\\"1705799507\\\"}\",\"{\\\"ID\\\":\\\"B08XZCM2P4\\\"}\",\"{\\\"ID\\\":\\\"1072855267\\\"}\",\"{\\\"ID\\\":\\\"B08VCMWPB9\\\"}\",\"{\\\"ID\\\":\\\"1623159229\\\"}\",\"{\\\"ID\\\":\\\"B08KH2J3FM\\\"}\",\"{\\\"ID\\\":\\\"B08D54RBGP\\\"}\",\"{\\\"ID\\\":\\\"1507212992\\\"}\",\"{\\\"ID\\\":\\\"1635653894\\\"}\",\"{\\\"ID\\\":\\\"B01MUB7BUV\\\"}\",\"{\\\"ID\\\":\\\"0358120861\\\"}\",\"{\\\"ID\\\":\\\"B08FV23D3F\\\"}\",\"{\\\"ID\\\":\\\"B08FNMP9YY\\\"}\",\"{\\\"ID\\\":\\\"1671590902\\\"}\",\"{\\\"ID\\\":\\\"1641527692\\\"}\",\"{\\\"ID\\\":\\\"1628603917\\\"}\",\"{\\\"ID\\\":\\\"B07ZHPQBVZ\\\"}\",\"{\\\"ID\\\":\\\"B08Y49Y63B\\\"}\",\"{\\\"ID\\\":\\\"B08T2QRSN3\\\"}\",\"{\\\"ID\\\":\\\"1729392164\\\"}\",\"{\\\"ID\\\":\\\"B08T46R6XC\\\"}\",\"{\\\"ID\\\":\\\"B08RRF5V1D\\\"}\",\"{\\\"ID\\\":\\\"1592339727\\\"}\",\"{\\\"ID\\\":\\\"1628602929\\\"}\",\"{\\\"ID\\\":\\\"1984857088\\\"}\",\"{\\\"ID\\\":\\\"0316529583\\\"}\",\"{\\\"ID\\\":\\\"1641524820\\\"}\",\"{\\\"ID\\\":\\\"1628602635\\\"}\",\"{\\\"ID\\\":\\\"B00GRIR87M\\\"}\",\"{\\\"ID\\\":\\\"B08FBHN5H7\\\"}\",\"{\\\"ID\\\":\\\"B06ZYSS7HS\\\"}\"]},\"autoadjustHeightFreescroll\":true,\"first_item_flush_left\":false,\"initThreshold\":100,\"loadingThresholdPixels\":100,\"name\":\"p13n-sc-shoveler_n1in5tlbg2h\",\"nextrequestSize\":6,\"set_size\":65}","faceoutspecs":"{}","faceoutkataname":"GeneralFaceout","indivIDuals":"0","language":"en-US","linkparameters":"{\"pd_rd_w\":\"eouzj\",\"pf_rd_p\":\"45451e33-456f-46b5-8f06-aedad504c3d0\",\"pf_rd_r\":\"6Q3MPZHQQ2ESWZND1K8T\",\"pd_rd_r\":\"e5e43c03-d78d-41d3-9064-87af93f9856b\",\"pd_rd_wg\":\"PdhmI\"}","marketplacEID":"ATVPDKIKX0DER","name":"p13n-sc-shoveler_n1in5tlbg2h","offset":"6","reftagprefix":"pd_sim","adisplayStrategy":"swap","aTransitionStrategy":"swap","a@R_502_6815@Strategy":"promise","IDs":["{\"ID\":\"B07SPSXHZ8\"}","{\"ID\":\"B08GG2RL1D\"}","{\"ID\":\"1507212305\"}","{\"ID\":\"B08QFBMSFT\"}","{\"ID\":\"164152247X\"}","{\"ID\":\"1673455980\"}","{\"ID\":\"B084DD8WHP\"}","{\"ID\":\"1706342667\"}","{\"ID\":\"1628603135\"}"],"indexes":[6,7,8,9,10,11,12,13,14]}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (windows NT 6.1) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/88.0.4324.104 Safari/537.36'
# for product_link in product_links:
s.headers['x-amz-acp-params'] = "tok=0DV5j8DDJsH8JQfdVFxJFD3P6AZraMOZTik-kgzNi08;ts=1619674837835;rID=ER1GSMM13VTETPS90K43;d1=251;d2=0;tpm=CGHBD;ref=rtpb"
res = s.post(url,Json=payload)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.SELEct("li.a-carousel-card-fragment > a.a-link-normal > div[data-rows]"):
print(item.text)
如何在没有硬编码负载的情况下从 customers who bought
部分抓取所有书籍?
当您查询产品 URL 时,获取轮播数据所需的一切都在初始请求中。
您需要获取完整的产品 HTML
,提取轮播数据并重用其中的一部分以构建可用于后续 POST
请求的有效负载。
然而,获得产品 HTML
是最难的部分,至少对我而言是这样,因为如果您过于频繁地请求 Amazon
,HTML
会阻止或抛出 CAPTCHA。
使用代理或 VPN 会有所帮助。交换产品网址有时也有帮助。
总结起来,关键是拿到产品HTML
。 AFAIK,后续请求很容易发出并且不会受到限制。
以下是获取轮播数据和从轮播数据获取数据的方法:
import json
import re
import requests
from bs4 import BeautifulSoup
# The chunk is how many carousel items are going to be requested for;
# this can vary from 4 - 10 items,as on the web-page.
# Also,the other list is used as the indexes key in the payload.
def get_idx_and_indexes(carousel_ids: list,chunk: int = 5) -> iter:
for index in range(0,len(carousel_ids),chunk):
tmp = carousel_ids[index:index + chunk]
yield tmp,[carousel_ids.index(item) for item in tmp]
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML,like Gecko) "
"Chrome/90.0.4430.93 Safari/537.36",}
product_url = '@R_772_10107@s://www.amazon.de/Rust-ProgrAMMing-Language-Covers-2018/dp/1718500440/'
# GetTing the product HTML as it carries all the carousel data items
with requests.Session() as session:
r = session.get("@R_772_10107@s://www.amazon.com",headers=headers)
page = session.get(product_url,headers=headers)
# This is where the carousel data sits along with all the items needed to make
# the following requests e.g. items,acp-params,linkparameters,marketplacEID etc.
initial_soup = BeautifulSoup(
re.search(r"<!--CardsClient-->(.*)<input",page.text).group(1),"lxml",).find_all("div")
# Preparing all the details for subsequent requests to carousel_endpoint
item_ids = json.loads(initial_soup[3]["data-a-carousel-options"])["ajax"]["id_list"]
payload = {
"aAjaxStrategy": "promise","aCarouselOptions": initial_soup[3]["data-a-carousel-options"],"aDisplayStrategy": "swap","aTransitionStrategy": "swap","faceoutkataname": "GeneralFaceout","faceoutspecs": "{}","individuals": "0","language": "en-US","linkparameters": initial_soup[0]["data-acp-tracking"],"marketplacEID": initial_soup[3]["data-marketplacEID"],"name": "p13n-sc-shoveler_hgm4oj1hneo",# this changes by can be ignored
"offset": "6","reftagprefix": "pd_sim",}
headers.@R_616_9531@e(
{
"x-amz-acp-params": initial_soup[0]["data-acp-params"],"x-requested-with": "XML@R_772_10107@request",}
)
# looping through the carousel data and performing requests
carousel_endpoint = " @R_772_10107@s://www.amazon.com/acp/p13n-desktop-carousel/funjjvdbohwkuezi/getCarouselItems"
for ids,indexes in get_idx_and_indexes(item_ids):
payload["ids"] = ids
payload["indexes"] = indexes
# The actual carousel data
response = session.post(carousel_endpoint,json=payload,headers=headers)
carousel = BeautifulSoup(response.text,"lxml").find_all("a")
print("\n".join(a.getText() for a in carousel))
这应该输出:
Cracking the Coding Interview: 189 ProgrAMMing Questions and Solutions
Gayle LaakmAnn McDowell
4.7 out of 5 starsâ4,864
#1 Best Sellerin Computer Hacking
$24.00
Container Security: Fundamental Technology Concepts that Protect Containerized Applications
Liz Rice
4.7 out of 5 starsâ102
$35.42
Linux Bible
Christopher Negus
4.8 out of 5 starsâ245
#1 Best Sellerin Linux Servers
$31.99
System Design Interview â An insider's guide,Second Edition
Alex Xu
4.5 out of 5 starsâ568
#1 Best Sellerin Bioinformatics
$24.99
Ansible for DevOps: Server and configuration management for humans
jeff Geerling
4.6 out of 5 starsâ127
$17.35
Effective C: An Introduction to Professional C ProgrAMMing
Robert C. Seacord
4.5 out of 5 starsâ94
$32.99
Hands-On Machine Learning with Scikit-Learn,Keras,and TensorFlow: Concepts,Tools,and Techniques to Build Intelligent Systems
Aurélien Géron
4.8 out of 5 starsâ1,954
#1 Best Sellerin Computer Neural Networks
$32.93
Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software
Eric Freeman
4.7 out of 5 starsâ67
$41.45
Fluent Python: Clear,Concise,and Effective ProgrAMMing
Luciano Ramalho
4.6 out of 5 starsâ523
54 offers from $32.24
TCP/IP Illustrated,Volume 1: The Protocols (Addison-Wesley Professional CompuTing Series)
4.6 out of 5 starsâ199
$63.26
OperaTing Systems: Three Easy Pieces
4.7 out of 5 starsâ224
#1 Best Sellerin Computer OperaTing Systems Theory
$24.61
Software ENGIneering at Google: Lessons Learned from ProgrAMMing Over Time
Titus Winters
4.6 out of 5 starsâ243
$44.52
and so on ...
以上是大佬教程为你收集整理的在没有硬编码有效负载的情况下无法从一个部分中抓取所有书籍全部内容,希望文章能够帮你解决在没有硬编码有效负载的情况下无法从一个部分中抓取所有书籍所遇到的程序开发问题。
如果觉得大佬教程网站内容还不错,欢迎将大佬教程推荐给程序员好友。
本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。