web scraping - Python Crawl - Amazom review crawl with BeautifulSoup

Question

Welcome To Ask or Share your Answers For Others

web scraping - Python Crawl - Amazom review crawl with BeautifulSoup

asked Feb 19, 2021 in Technique[技术] by 深蓝 (71.8m points)

web scraping - Python Crawl - Amazom review crawl with BeautifulSoup

I try to crawl review data from amazon with Jupeter notebook.

But there is response 503 from server.

Does anyone know what's wrong with it?

Here is url. https://www.amazon.com/Apple-MWP22AM-A-AirPods-Pro/product-reviews/B07ZPC9QD4/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=

Here is my code.

import re, requests, csv 
from bs4 import BeautifulSoup 
from time import sleep

def reviews_info(div): 
    review_text = div.find("div", "a-row a-spacing-small review-data").get_text() 
    review_author = div.find("span", "a-profile-name").get_text()
    review_stars = div.find("span", "a-icon-alt").get_text() 
    on_review_date = div.find('span', 'a-size-base a-color-secondary review-date').get_text() 
    review_date = [x.strip() for x in re.sub("on ", "", on_review_date).split(",")] 

    return { "review_text" : review_text, 
            "review_author" : review_author, 
            "review_stars" : review_stars, 
            "review_date": review_date }
base_url = 'https://www.amazon.com/Apple-MWP22AM-A-AirPods-Pro/product-reviews/B07ZPC9QD4/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber='


reviews = [] 

NUM_PAGES = 8

for page_num in range(1, NUM_PAGES + 1): 
    print("souping page", page_num, ",", len(reviews), "data collected") 
    url = base_url + str(page_num) 
    soup = BeautifulSoup(requests.get(url).text, 'lxml') 

    for div in soup('div', 'a-section review'): 
        reviews.append(reviews_info(div)) 
    
    sleep(30)

Finally I tried

requests.get(url)

The output is

<Response [503]>

And I also tried

requests.get(url).text()

The output is

TypeError: 'str' object is not callable

Did Amazon blocked crawl?

I'd appreciate your answer!

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-02-19T03:53:44+0000

Amazon blocks the request to their servers when you attempt to crawl it, using python request lib. You can try using Selenium with chromium browser that might do the trick. Here python version of Selenium: https://selenium-python.readthedocs.io/.

Categories

web scraping - Python Crawl - Amazom review crawl with BeautifulSoup

web scraping - Python Crawl - Amazom review crawl with BeautifulSoup

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags