I am trying to fetch a website. It works in the browser, and works in curl but throws 403 in python (which comes along with a captcha, but that is irrelevant as I am trying to fidn the difference between python requests and other libraries).
Similar questions have been asked before but I have taken those recommendations and done additional analysis so hopefully someone from python community can help as this remains an unsolved issue.
This works:
curl -v --http1.1 -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:127.0) Gecko/20100101 Firefox/127.0" https://www.kron4.com/
> GET / HTTP/1.1
> Host: www.kron4.com
> Accept: */*
> User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:127.0) Gecko/20100101 Firefox/127.0
HTTP 2 works too:
curl -v -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:127.0) Gecko/20100101 Firefox/127.0" https://www.kron4.com/
> GET / HTTP/2
> Host: www.kron4.com
> Accept: */*
> User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:127.0) Gecko/20100101 Firefox/127.0
Python request miserably fails even when I match the header.
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:127.0) Gecko/20100101 Firefox/127.0'}
response = requests.get(url, headers=headers)
response.raise_for_status()
>File "/Users/xxxxx/Documents/workspace/NewsFu/myenv/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.kron4.com/
I know python adds a couple of things like Accept, but I don’t think that is an issue as browser adds them too. It supports only HTTP/1.1 that’s why I tried it with curl. At this point I am thinking that maybe header is not the issue and the website is detecting something else unusual in python request. But what it is beats me because everything is encrypted. Otherwise I could use wireshark to disect it further. Any ideas are welcome.