Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
484 views
in Technique[技术] by (71.8m points)

python - Multiprocessing for WebScraping wont start on Windows and Mac

I asked a question here about multiprocessing a few days ago, and one user sent me the answer that you can see below. Only problem is that this answer worked on his machine and does not work on my machine.

I have tried on Windows (Python 3.6) and on Mac(Python 3.8). I have ran the code on basic Python IDLE that came with installation, in PyCharm on Windows and on Jupyter Notebook and nothing happens. I have 32 bit Python. This is the code:

from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta
from multiprocessing import Pool
import tqdm

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

def parse(url):
    print("im in function")

    response = requests.get(url[4], headers = headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    all_skier_names = soup.find_all("div", class_ = "g-xs-10 g-sm-9 g-md-4 g-lg-4 justify-left bold align-xs-top")
    all_countries = soup.find_all("span", class_ = "country__name-short")

    discipline = url[0]
    season = url[1]
    competition = url[2]
    gender = url[3]

    out = []
    for name, country in zip(all_skier_names , all_countries):
        skier_name = name.text.strip().title()
        country = country.text.strip()
        out.append([discipline, season,  competition,  gender,  country,  skier_name])


    return out

all_urls = [['Cross-Country', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=M&nationcode='],
            ['Cross-Country', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=L&nationcode='],
            ['Cross-Country', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=M&nationcode='],
            ['Cross-Country', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=L&nationcode=']]

with Pool(processes=2) as pool, tqdm.tqdm(total=len(all_urls)) as pbar:
    all_data = []
    print("im in pool")

    for data in pool.imap_unordered(parse, all_urls):
        print("im in data")

        all_data.extend(data)
        pbar.update()

print(all_data) 

The only thing that I see when I run the code is progress bar, thats always at 0%:

  0%|          | 0/8 [00:00<?, ?it/s]

I set the couple of print statements in the parse(url) function and in for loop at the end of the code but still, only thing thats printed is "im in pool". It seams like code does not enter the function at all, and it does not go in for loop at the end of the code.

The code should execute in 5-8 seconds, but Im waiting for 10 minutes and nothing is happening. I have also tried to do this without progress bar, but the result is the same.

Do you know whats the problem? Is it the problem with version of Python that im using (Python 3.6 32 bit) or version of some lib, IDK what to do...

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

A better choice for you would be multithreading, which Python implements using the threading module:

import threading
    
if __name__ == "__main__": 
logging.basicConfig(level=logging.INFO)
threads = list()

for scraper in scraper_list:
    logging.info("Main    : create and start thread %s.", scraper)
    x = threading.Thread(target=scraper_checker, args=(scraper,))
    threads.append(x)
    x.start()

for index, thread in enumerate(threads):
    thread.join()
    logging.info("Main    : thread %d done", index)

error_file.close()
success_file.close()
    
  
print("Done!") 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...