I’m trying to scrape Google reviews from a restaurant’s page using Selenium and BeautifulSoup in Python. However, I’m only able to capture the first three reviews, even though I have implemented a scrolling mechanism to load more reviews.
Here is my current code:
import sys
sys.path.insert(0, '/usr/lib/chromium-browser/chromedriver')
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import chromedriver_autoinstaller
# Setup Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # Ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
# Install ChromeDriver
chromedriver_autoinstaller.install()
# Set the target URL
url = "https://www.google.com/search?sa=X&sca_esv=ad6f04eae533dd67&rlz=1C1VDKB_esES1003ES1026&tbm=lcl&sxsrf=ADLYWIJ2x8yhVnxF3mww1MpUgWLVWhx7_A:1718536579993&q=Lamonarracha+Chamber%C3%AD+%7C+Restaurante+japon%C3%A9s+fusi%C3%B3n+Rese%C3%B1as&rflfq=1&num=20&stick=H4sIAAAAAAAAAONgkxI2M7QwN7S0MDCyMDIyMzS2NLQw3cDI-IrRzicxNz8vsagoMTkjUcE5IzE3KbXo8FqFGoWg1OKSxNKixLySVIWsxIL8vMMrixXSSoszD2_OA0mmHt6YWLyIlUIDAKzNu6KcAAAA&rldimm=6187198028226139185&hl=es-ES&ved=2ahUKEwj294jT_9-GAxVigf0HHa1VAj8Q9fQKegQIUBAF&biw=1536&bih=695&dpr=1.25#lkt=LocalPoiReviews"
# Set up the webdriver
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
# Verify we are on the correct page
print("Page title:", driver.title)
# Wait until the reviews tab is clickable and click it
wait = WebDriverWait(driver, 20)
reviews_tab = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'a[role="tab"]')))
reviews_tab.click()
# Wait for the reviews section to load
time.sleep(5) # Allow some time for the reviews to load
# Scroll to load more reviews if necessary
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # Allow some time for the reviews to load
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# Parse the page source with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
# Find the reviews using the provided class
reviews = soup.find_all('div', class_='OA1nbd')
# Extract the text from each review
review_texts = [review.get_text() for review in reviews]
# Store the reviews in a DataFrame
reviews_df = pd.DataFrame(review_texts, columns=['Review'])
# Save the DataFrame to a CSV file in the current working directory
csv_file_path = 'google_reviews.csv'
reviews_df.to_csv(csv_file_path, index=False, encoding='utf-8')
# Display the DataFrame
print(reviews_df)
# Quit the driver
driver.quit()
print(f"Reviews have been saved to {csv_file_path}")
Issue:
Despite the implementation of the scrolling mechanism, the script only captures the first three reviews and does not load more, even though the page has more reviews available.
What I’ve tried:
Scrolling down: I included a loop to scroll down the page multiple times to ensure more reviews load.
Waiting for content to load: Added time.sleep to allow time for the reviews to load after each scroll.
Checking page height: Used the page height to detect if new content is loaded.
Observations:
The height of the page does not change after the first scroll, indicating no new content is being loaded.
Only the first three reviews are captured.
Request:
Can someone help me identify what I might be doing wrong or suggest a different approach to ensure all reviews are captured?
Thank you in advance!
user25593111 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.