I am trying to deploy a web scraping app. My app works perfectly on my machine. I can deploy it through Heroku, but when I visit the webpage I get an error when I run the app. The application logs show it’s a 500 internal server error.
I think it is because I am using the chrome browser without the headless option. Here is my code for that part of the app:
from flask import Flask, render_template, request, redirect, url_for, send_file
import pandas as pd
from splinter import Browser
from bs4 import BeautifulSoup
import time
import datetime
import re
import numpy as np
import io
from selenium.webdriver.chrome.options import Options
import os
import usaddress
app = Flask(__name__)
# Define the expected access password
expected_password = "Real_Scrape-672024"
@app.route('/')
def index():
return render_template('index.html')
@app.route('/scrape', methods=['POST'])
def scrape():
# Get form data
realty_trac_url = request.form['realty_trac_url']
narrpr_email = request.form['narrpr_email']
narrpr_password = request.form['narrpr_password']
file_city_name = request.form['file_city_name']
access_password = request.form['access_password']
# Check if access password matches the expected password
if access_password == expected_password:
# Proceed with scraping
# Initialize the browser (you might want to handle browser initialization inside the route function)
# Set up Chrome options to open in full screen
chrome_options = Options()
chrome_options.add_argument("--start-maximized")
# Start browser session with Chrome options
browser = Browser('chrome', options=chrome_options)
# Open Realty Trac URL
browser.visit(realty_trac_url)
time.sleep(5)
html = browser.html
soup = BeautifulSoup(html, 'html.parser')
If I deploy the headless browser option for chrome, the app starts working, but does not scrape the data I need as the html of the website changes with the headless option on. It will then error out as there is no data that has been scraped.
To try and fix this I printed out the html captured using the headless version of chrome and reworked the scraping code. However, the headless option is causing more issues as there are additional changes in the html code on the website depending on the which url I go to for the scraping. On the second website I am trying to scrap with the app, it won’t scrape correctly as the site is JavaScript based and does not load fully, even with time delays in the code. I did not have these issues when using the scraping program without the headless option selected.
Is there a work around or something I can do to fix this so I can run the app without the headless option on Heroku? If I have to use the headless option, how can I fix the issue with the second site not even fully loading? The two sites I am referring to are www.realtytrac.com and www.narrpr.com(this one doesn’t load fully with headless). Narrpr.com requires a login to access (which I have).
OR would it be better for me to just do the scraping on my end and setup a database for the app to get the data from when it’s live?