Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
348 views
in Technique[技术] by (71.8m points)

web scraping - Python Crawl - Amazom review crawl with BeautifulSoup

I try to crawl review data from amazon with Jupeter notebook.

But there is response 503 from server.

Does anyone know what's wrong with it?

Here is url. https://www.amazon.com/Apple-MWP22AM-A-AirPods-Pro/product-reviews/B07ZPC9QD4/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=

Here is my code.

import re, requests, csv 
from bs4 import BeautifulSoup 
from time import sleep

def reviews_info(div): 
    review_text = div.find("div", "a-row a-spacing-small review-data").get_text() 
    review_author = div.find("span", "a-profile-name").get_text()
    review_stars = div.find("span", "a-icon-alt").get_text() 
    on_review_date = div.find('span', 'a-size-base a-color-secondary review-date').get_text() 
    review_date = [x.strip() for x in re.sub("on ", "", on_review_date).split(",")] 

    return { "review_text" : review_text, 
            "review_author" : review_author, 
            "review_stars" : review_stars, 
            "review_date": review_date }
base_url = 'https://www.amazon.com/Apple-MWP22AM-A-AirPods-Pro/product-reviews/B07ZPC9QD4/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber='


reviews = [] 

NUM_PAGES = 8

for page_num in range(1, NUM_PAGES + 1): 
    print("souping page", page_num, ",", len(reviews), "data collected") 
    url = base_url + str(page_num) 
    soup = BeautifulSoup(requests.get(url).text, 'lxml') 

    for div in soup('div', 'a-section review'): 
        reviews.append(reviews_info(div)) 
    
    sleep(30)

Finally I tried

requests.get(url)

The output is

<Response [503]>

And I also tried

requests.get(url).text()

The output is

TypeError: 'str' object is not callable

Did Amazon blocked crawl?

I'd appreciate your answer!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Amazon blocks the request to their servers when you attempt to crawl it, using python request lib. You can try using Selenium with chromium browser that might do the trick. Here python version of Selenium: https://selenium-python.readthedocs.io/.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...