I have been trying to extract transaction records from this website: https://www.house730.com/en-us/deal/?type=rent.
Looking into stack overflow, I have stumbled into a solution that uses urllib.request
+ selenium.webdriver
to download and render a webpage.
Something like this, in load_data.py
:
from selenium import webdriver
from urllib.request import urlopen
import os
url = "https://www.house730.com/en-us/deal/?type=rent"
file_name = os.path.abspath(".") + "/tmp"
conn = urlopen(url)
data = conn.read()
conn.close()
file = open(file_name, "wb")
file.write(data)
file.close()
browser = webdriver.Firefox()
browser.get("file:///" + file_name)
html = browser.page_source
browser.quit()
print(html)
However, when I ran
python load_data.py > tmp.html
and open tmp.html
. It seems the page crashes:
This also happens with wget
.
wget "https://www.house730.com/en-us/deal/?type=rent" -O index.html
but they give different result html. Why?
Result from load_data.py:
https://gist.github.com/pond-nj/5fd51f81441463996ed20a8003981742#file-load_data_tmp-html
Result from wget
:
https://gist.github.com/pond-nj/5fd51f81441463996ed20a8003981742#file-wget_index-html
Seems like wget
already processed the html more than load_data.py
. Because wget has a bunch of formatted records <div class="deal-data"
/>. Why is this?
Also, this might be a hard question to ask. But what could be the reason that the page crashes when loaded html is open.