I am trying to configure scrapy-splash to work through a proxy, but the result is always the 504 error HTTP code:
DEBUG:scrapy.core.engine:Crawled (504) <GET https://www.website.com via http://localhost:8050/execute> (referer: None)
INFO:scrapy.spidermiddlewares.httperror:Ignoring response <504 https://www.website.com>: HTTP status code is not handled or not allowed
This is my basic setup:
lua_script = """
function main(splash, args)
splash:go(args.url)
local num_scrolls = 3
local wait_after_scroll = 1.0
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() { return document.body.scrollHeight; }"
)
-- scroll to the end for "num_scrolls" time
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(wait_after_scroll)
end
return splash:html()
end
...
def start_requests(self):
yield SplashRequest(URL,
callback=self.parse,
endpoint="execute",
args={
'wait': 5,
'lua_source': lua_script,
'proxy': API_PROXY
})
I also tried a version where I specified the proxy in the lua script:
lua_script = """
function main(splash, args)
splash:on_request(function(request)
request:set_proxy("http://proxy.server.com:8001")
request:set_header("Proxy-Authorization", "Basic " .. splash:base64_encode("username:password"))
end)
splash:go(args.url)
splash:wait(1) -- Wait for page to load (customize wait time as needed)
return splash:html()
end
"""
Yet the result was the same – 504 error code.
I am running this script on my localhost (OSX) – the splash server is ran via docker (docker run -it -p 8050:8050 --rm scrapinghub/splash
).
I saw on StackOverflow a few similar questions, but only with a few answers and the answers unfortunately didn’t help me to make this work.
If I remove the proxy from the query, the script is running and is fetching data. How to properly configure scrapy-splash with the proxy?