I am reading the book "The Ultimate Guide to Web Crawling"
The code used to run the first HTTP get-request is the following:
import requests
url = "https://scrapethissite.com/pages/simple/"
r = requests.get(url)
print("We got a {} response code from {}".format(r.status_code, url))
I got the error message:
HTTPSConnectionPool(host='scrapethissite.com', port=443): Max retries
exceeded with url: /pages/simple/ (Caused by SSLError(SSLError(1,
'[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1123)')))
I understand that my request doesn't go the right port. Is it linked to the fact that the website uses the communication protocol HTTPS (vs HTTP)? I am not sure, but it seems to be part of the problem.
I am using Python 3.8 on PyCharm. My SSL version is:
OpenSSL 1.1.1g 21 Apr 2020
I am a beginner in webcrawling. This is why I chose to run an alternative code to run my HTTP get-request, one that would allow me to select the appropriate port and protocol (Source: https://pythonprogramming.net/python-sockets/):
import socket
import ssl
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
context.verify_mode = ssl.CERT_REQUIRED
context.check_hostname = True
context.load_default_certs()
server = 'scrapethissite.com'
port = 443
server_ip = socket.gethostbyname(server)
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = context.wrap_socket(s, server_hostname=server)
request = "GET / HTTP/1.1
Host: "+server+"
"
s.connect((server, port))
s.send(request.encode())
result = s.recv(4096)
while (len(result) > 0):
print(result)
result = s.recv(4096)
I got the HTTP 200 OK status response so it is working well. I get this output in the PyCharm terminal:
b'HTTP/1.1 200 OK
Date: Tue, 12 Jan 2021 14:59:35
GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding:
chunked
Connection: keep-alive
Set-Cookie:
__cfduid=d205b0b8e8ce061174412767189bf10b41610463575; expires=Thu, 11-Feb-21 14:59:35 GMT; path=/; domain=.scrapethissite.com; HttpOnly;
SameSite=Lax
CF-Cache-Status: DYNAMIC
cf-request-id:
0798b515a60000ea04f707d000000001
Expect-CT: max-age=604800,
report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Report-To:
{"endpoints":[{"url":"https://a.nel.cloudflare.com/report?s=%2FROG7Z2JWZJBMeVNn1IgnJh2TZsqJCi9TJOL3zau98btlLo1nPg4WhGlmOz2SZ6PRep6%2BKZfv0M81fqKOw1l6%2BRbc5M9dErdtyeTsei9Ee%2F2jc0%3D"}],"group":"cf-nel","max_age":604800}
NEL:
{"report_to":"cf-nel","max_age":604800}
Server:
cloudflare
CF-RAY: 6107be029e27ea04-IAD
1fb5
<!doctype
html>
Scrape This Site | A public sandbox for learning web
scraping
Scrape This Site
Sandbox
Lesson' b's
FAQ
Login
var path =
document.location.pathname;
var tab = undefined;
if (path === "/"){
tab =
document.querySelector("#nav-homepage");
} else if
(path.indexOf("/faq/") === 0){
tab =
document.querySelector("#nav-faq");
} else if
(path.indexOf("/lessons/") === 0){
tab =
document.querySelector("#nav-lessons");
} else if
(path.indexOf("/pages/") === 0) {
tab =
document.querySelector("#nav-sandbox");
} else if
(path.indexOf("/login/") === 0) {
tab = do'
b'cument.querySelector("#nav-login");
}
tab.classList.add("active")
Scrape This Site
The internet's best resource for learning
web
scraping.
Explore Sandbox
Begin Lessons →
Lessons and Videos © Hartley
Bro' b'dy 2018
PNotify.prototype.options.styling = "bootstrap3";
$(function(){
});
$(function () {
$('[data-toggle="tooltip"]').tooltip()
})
$("video").hover(function() {
$(this).prop("controls",
true);
}, function() {
$(this).prop("controls",
false);
});
$("video").click(function() {
if(
this.paused){
this.play();
}
else {
this.pause();
}
});
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new
Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-41551755-8', 'auto');
ga('send',
'pageview');
!function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod?
n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n;
n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0;
t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window,
document,'script','https://connect.facebook.net/en_US/fbevents.js');
fbq('init', '764287443701341');
fbq('track',
"PageView");
/* */
window.dataLayer = window.dataLayer || [];
function
gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'AW-950945448');
'
b'0
'
The only problem is that I want to scrape this website:
https://scrapethissite.com/pages/simple/
and not:
https://scrapethissite.com
When I replace
server = 'scrapethissite.com'
by:
server = 'scrapethissite.com/pages/simple/'
in the previous code, I get this new error message:
socket.gaierror: [Errno 11001] getaddrinfo failed
My understanding is that the problem is linked to the proxy. Knowing that the problem may be linked to port, socket, proxy, etc., is informative, but I am not sure what/how to fix the code as it is working fine for one website but not the other.
Any help is highly appreciated. Thank you!
Following OneCricketeer's reply, the code is now:
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
context.verify_mode = ssl.CERT_REQUIRED
context.check_hostname = True
context.load_default_certs()
server = 'scrapethissite.com'
port = 443
server_ip = socket.gethostbyname(server)
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = context.wrap_socket(s, server_hostname=server)
request = "GET /pages/simple HTTP/1.1
Host: "+server+"
"
s.connect((server, port))
s.send(request.encode())
result = s.recv(4096)
while (len(result) > 0):
print(result)
result = s.recv(4096)
I get HTTP 301 MOVED PERMANENTLY status response.
b'HTTP/1.1 301 MOVED PERMANENTLY
Date: Tue, 12 Jan 2021 15:34:15
GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding:
chunked
Connection: keep-alive
Set-Cookie:
__cfduid=d6e32136f617c0b90e7f92a3e391c159f1610465655; expires=Thu, 11-Feb-21 15:34:15 GMT; path=/; domain=.scrapethissite.com; HttpOnly;
SameSite=Lax
Location:
https://scrapethissite.com/pages/simple/
CF-Cache-Status:
DYNAMIC
cf-request-id:
0798d4d0d700002550fc1c3000000001
Expect-CT: max-age=604800,
report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Report-To:
{"endpoints":[{"url":"https://a.nel.cloudflare.com/report?s=2moOTvTDPvS65D6d0LvsiZTLDqYcv8OFZvtunIQDq6H%2FKLucm1LOOlMABcnCUjUO9fK4bwd%2BVDiescQ0NyHbu3DxhTCkOUHTvMcilkM%2BdcZnz3A%3D"}],"group":"cf-nel","max_age":604800}
NEL:
{"report_to":"cf-nel","max_age":604800}
Server:
cloudflare
CF-RAY: 6107f0c7bb432550-IAD
11f
Redirecting...
Redirecting...
You
should be redirected automatically to target URL: https://scrapethissite.com/pages/simple/.
If not click the link.
' b'0
'
Is there something I missed?