I a trying to extract images and description metadata from the European Space Agency image gallery website:
https://www.esa.int/ESA_Multimedia/Sets/Earth_from_Space_image_collection/(result_type)/images
The high res images I am trying to extract, along with their descriptions and image credits, are only accessible by clicking on the postage stamp images and navigating to the download button.
I have tried to extract the images using beautifulsoup4 and requests in Python, but I can only seem to grab the postage stamp images on a single page of the gallery. Everything else appears to be spread out over multiple pages, obfuscated, or deeply nested.
Any ideas?
Here is my code:
import requests
from bs4 import BeautifulSoup
import os
base_url = "https://www.esa.int"
main_url = "https://www.esa.int/ESA_Multimedia/Sets/Earth_from_Space_image_collection/(result_type)/images"
response = requests.get(main_url)
soup = BeautifulSoup(response.content, 'html.parser')
# Get all the thumbnail links
thumb_links = [base_url + a['href'] for a in soup.find_all('a', class_='fancybox')]
if not os.path.exists('ESA_Full_Images'):
os.makedirs('ESA_Full_Images')
# Iterate through each thumbnail link to find the full image URL
for idx, link in enumerate(thumb_links):
img_page_response = requests.get(link)
img_page_soup = BeautifulSoup(img_page_response.content, 'html.parser')
# Find the full-size image URL
full_img_tag = img_page_soup.find('div', class_='image').find('img')
if full_img_tag:
full_img_url = full_img_tag['src']
full_img_url = base_url + full_img_url
img_data = requests.get(full_img_url).content
with open(f'ESA_Full_Images/image_{idx + 1}.jpg', 'wb') as handler:
handler.write(img_data)
else:
print(f"Full image not found for thumbnail link: {link}")
print("Full images download completed!")
Christopher Phillips is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.