I know there are many other questions regarding this issue but they seem…outdated (?) or at least, they do not work anymore. I tried multiple methods such as proxy rotators, custom proxy lists (which ideally I’d like to avoid), using tor sessions through python, but none of these methods led me anywhere else besides this error:
--more--
<p class="text-center text-red-800">
As you were using this website, something about your browser or behaviour made us think you might be a bot.<br/>Solve the captcha below to continue browsing the site.
</p>
--more--
Basically I am using python to scrape a website that offers properties like rooms, apartments etc. But I made a lot of requests (and that’s the goal of the script) and now Im met with the mentioned captcha response.
The important part of my code that makes the requests and the initialization of the session is the following:
import requests
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
import random
def init(url):
global session
global proxies
session = requests.Session()
proxies = [
'http://35.185.196.38:3128',
'https://35.185.196.38:3128',
'http://202.86.138.18:8080',
'https://202.86.138.18:8080',
'https://20.206.106.192:80',
'https://20.210.113.32:80',
'https://20.206.106.192:8123',
'https://89.43.31.134:3128',
'https://88.198.212.91:3128',
'http://213.217.30.69:3128',
'https://213.217.30.69:3128',
'https://204.109.59.194:3121',
'https://20.111.54.16:8123',
'https://195.154.184.80:8080',
]
proxy = random.choice(proxies)
print(f"Using proxy: {proxy}")
user_agent = UserAgent()
session.headers.update({'User-Agent': str(user_agent)})
response = session.get('https://[website_url]/')
assert response.status_code == 200
response = session.get('https://[website_url]/cgi-bin/fl/js/verify')
assert response.status_code == 200
try:
response = session.get(url, proxies={'http': proxy, 'https': proxy}, timeout=10)
return response
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
return None
def scrape_website(url):
response = session.get(url)
if response.status_code == 200:
print(response.text) # Here is where I print the response which contains the captcha response.
-- rest of the code --