I have followed several online guides in an attempt to build a script that can identify and download all pdfs from a website to save me from doing it manually. Here is my code so far:
from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib
# connect to website and get list of all pdfs
url="http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html"
response = request.urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(.pdf)'))
# clean the pdf link names
url_list = []
for el in links:
url_list.append(("http://www.gatsby.ucl.ac.uk/teaching/courses/" + el['href']))
#print(url_list)
# download the pdfs to a specified location
for url in url_list:
print(url)
fullfilename = os.path.join('E:webscraping', url.replace("http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016/", "").replace(".pdf",""))
print(fullfilename)
request.urlretrieve(url, fullfilename)
The code can appear to find all the pdfs (uncomment the print(url_list)
to see this). However, it fails at the download stage. In particular I get this error and I am not able to understand what's gone wrong:
E:webscraping>python get_pdfs.py
http://www.gatsby.ucl.ac.uk/teaching/courses/http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016/cribsheet.pdf
E:webscrapinghttp://www.gatsby.ucl.ac.uk/teaching/courses/cribsheet
Traceback (most recent call last):
File "get_pdfs.py", line 26, in <module>
request.urlretrieve(url, fullfilename)
File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 532, in open
response = meth(req, response)
File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 570, in error
return self._call_chain(*args)
File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 504, in _call_chain
result = func(*args)
File "C:UsersUserAnaconda3envssnakeliburllib
equest.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
Can somebody help me please?
See Question&Answers more detail:
os