Trying to extract data from https://www.israelhayom.co.il/sitemaps/wp-sitemap-posts-post-2.xml
I am using the following code:
def extract_urls(sitemap_url):
"""
Extracts URLs from a given sitemap.
Sends a GET request to the provided sitemap URL, parses the response content using BeautifulSoup,
and extracts all URLs from the sitemap. The URLs can be in XML or HTML format.
"""
urls = []
response = requests.get(sitemap_url)
soup = BeautifulSoup(response.content, 'xml')
# XML
if soup.find_all('url'):
for url in soup.find_all('loc'):
urls.append(url.text)
When I run this code on my PC with PyCharm, it works without issue once so ever, but whenever I run it on a Colab notebook, instead of the full HTML, I get an empty <body>
tag:
b'<!DOCTYPE html><html><head><meta charset="utf-8"><script type="text/javascript" src="/kramericaindustries.ac.lib.js"></script><script type="text/javascript">n;;window.rbzns={"bereshit":"1","seed":"PasTPeWUkqeZMoEaPkmruDvqe3MEFT+2tD5O7sQ9Nd7m\/2SqjSJAKvXovDqJs+XiKOugTjVaoBRqnLPDrP2\/NNNamICsyoBCPOF3sB6l5jI=","location_host":"www.israelhayom.co.il","storage":3,"protocol":"https:"};winsocks();</script></head><body></body></html>n'
I tried to look on different threads with similar questions, like these:
- Why am I getting an empty body tag content when trying to use web scraping using the requests library?
- request.get(url) returns empty content
Although I added an agent/headers, and even replaced the library with Selenium
, the behavior was similar – and I could not retrieve the page’s body while running it on Google Colab Notebook:
Code using selenium:
%pip install -q google-colab-selenium
import google_colab_selenium as gs
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
driver = gs.Chrome(options=chrome_options)
driver.get("https://www.israelhayom.co.il/sitemaps/wp-sitemap-posts-post-2.xml")
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup.prettify())
driver.quit()
result:
<html>
<head>
<meta charset="utf-8"/>
<script src="/kramericaindustries.ac.lib.js" type="text/javascript">
</script>
<script type="text/javascript">
;;window.rbzns={"bereshit":"1","seed":"oqK57LpbY0K6W+60fOevy6HP6dF1CT7VvSvLu3+N2OxP9an2NdC3t0imrvTIjE2y+eO9BH5DzZbT0cye4dkNKdsSdVbFy8NGD2k1XzXFAB3VKzkpdlBHOJHmIErmkZhmK2itj9qDshntSUh1kevRaGhPR7Xl+Y9D2a4y8CqVQJ/lWgzMiutqfeGvuLLhuHl/zrdi1/CG5c+kOu4GI7WCNTzg9LYYugvKCD8g3imH2eE=","location_host":"www.israelhayom.co.il","storage":3,"protocol":"https:"};winsocks();
</script>
</head>
<body>
</body>
</html>
What could be the problem?