Python selenium multiprocessing

Question

Welcome To Ask or Share your Answers For Others

Python selenium multiprocessing

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Python selenium multiprocessing

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my try (it's a working one):

import requests
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver

def get_links(link):
  res = requests.get(link)
  soup = BeautifulSoup(res.text,"lxml")
  titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")]
  return titles

def get_title(url):
  chromeOptions = webdriver.ChromeOptions()
  chromeOptions.add_argument("--headless")
  driver = webdriver.Chrome(chrome_options=chromeOptions)
  driver.get(url)
  sauce = BeautifulSoup(driver.page_source,"lxml")
  item = sauce.select_one("h1 a").text
  print(item)

if __name__ == '__main__':
  url = "https://stackoverflow.com/questions/tagged/web-scraping"
  ThreadPool(5).map(get_title,get_links(url))

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:10:23+0000

how can I reduce the execution time using selenium when it is made to run using multiprocessing

A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:

(... skipped for brevity ...)

threadLocal = threading.local()

def get_driver():
  driver = getattr(threadLocal, 'driver', None)
  if driver is None:
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument("--headless")
    driver = webdriver.Chrome(chrome_options=chromeOptions)
    setattr(threadLocal, 'driver', driver)
  return driver


def get_title(url):
  driver = get_driver()
  driver.get(url)
  (...)

(...)

On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.

Note: ThreadPool uses threads, which are constrained by the Python GIL. That's ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.

Categories

Python selenium multiprocessing

Python selenium multiprocessing

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags