I am trying to add an exception feature in an ML project I am working on, I create a web app which accepts student performance data as a CSV file and then performs different machine learning algorithms and selects and saves the model with the best R2 score, it deletes any previous model if already existing and replaces it with model trained on new data, and then displays the R2 score to the user. The app is working fine with correct data, I tried to build a process to show an error message to the user if the input data is incorrect. I have this use case where in the following portion of the CSV file I deleted one of the column entries in one of the record:
"gender","race_ethnicity","parental_level_of_education","lunch","test_preparation_course","math_score","reading_score","writing_score"
"female","group B","bachelor's degree","standard","none","72","72","74"
"female","group C","some college","standard","completed","69","90","88"
"female","group B","master's degree","standard","none","90","95","93"
"male","group A","associate's degree","free/reduced","none","47","57","44"
Here I changed the second entry, from "female","group C","some college","standard","completed","69","90","88"
to "female","some college","standard","completed","69","90","88"
, to check how it handles the error. Actually, as I share below the log file, it shows that the program was able to create a model, maybe because I used imputer to fix missing values, and thus was able to build a model and show the R2 score in the logs. The issue is, that it is not showing the R2 score, nor any error on the webpage, instead the site stops working and shows error code 400, but in the logs it shows status code 200, and doesn’t show any error in terminal. I am sharing the screenshot of Network tab of developer options, if it may help in figuring out the issue.
Screenshot of crashed web page after submitting incorrect input file
Logs file output:
[2024-07-18 09:26:07,184] _internal.py:97 _log() werkzeug - INFO - [31m[1mWARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.[0m
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:8080
* Running on http://172.16.5.4:8080
[2024-07-18 09:26:07,184] _internal.py:97 _log() werkzeug - INFO - [33mPress CTRL+C to quit[0m
[2024-07-18 09:26:27,148] _internal.py:97 _log() werkzeug - INFO - 127.0.0.1 - - [18/Jul/2024 09:26:27] "GET / HTTP/1.1" 200 -
[2024-07-18 09:26:35,995] _internal.py:97 _log() werkzeug - INFO - 127.0.0.1 - - [18/Jul/2024 09:26:35] "GET / HTTP/1.1" 200 -
[2024-07-18 09:26:42,111] _internal.py:97 _log() werkzeug - INFO - 127.0.0.1 - - [18/Jul/2024 09:26:42] "GET /input HTTP/1.1" 200 -
[2024-07-18 09:27:04,034] train_pipeline.py:35 delete_and_recreate_model() root - INFO - Saved new raw data as CSV file
[2024-07-18 09:27:04,034] data_ingestion.py:26 initiate_data_ingestion() root - INFO - Entered data ingestion method or component
[2024-07-18 09:27:04,041] data_ingestion.py:29 initiate_data_ingestion() root - INFO - Read dataset as df
[2024-07-18 09:27:04,047] data_ingestion.py:35 initiate_data_ingestion() root - INFO - Train test split initiating
[2024-07-18 09:27:04,068] data_ingestion.py:42 initiate_data_ingestion() root - INFO - ingestion of data completed
[2024-07-18 09:27:04,071] data_transformation.py:60 initiate_data_transformation() root - INFO - Read train and test data completed
[2024-07-18 09:27:04,074] data_transformation.py:76 initiate_data_transformation() root - INFO - numerical features are Index(['reading_score', 'writing_score'], dtype='object') and categorical features are Index(['gender', 'race_ethenicity', 'parental_level_of_education', 'lunch',
'test_preparation_course'],
dtype='object')
[2024-07-18 09:27:04,074] data_transformation.py:31 get_data_transformer() root - INFO - numerical columns scaling completed
[2024-07-18 09:27:04,074] data_transformation.py:40 get_data_transformer() root - INFO - categorical columns logging completed
[2024-07-18 09:27:04,074] data_transformation.py:81 initiate_data_transformation() root - INFO - applying preprocessing object on train and test df
[2024-07-18 09:27:04,124] model_trainer.py:32 initiate_model_trainer() root - INFO - Split training and test input data
[2024-07-18 09:27:23,565] _internal.py:97 _log() werkzeug - INFO - 127.0.0.1 - - [18/Jul/2024 09:27:23] "GET /input HTTP/1.1" 200 -
[2024-07-18 09:28:06,418] model_trainer.py:99 initiate_model_trainer() root - INFO - Best model found
[2024-07-18 09:28:06,420] train_pipeline.py:49 delete_and_recreate_model() root - INFO - New R2 score is: 0.8803008999935347
[2024-07-18 09:28:06,420] application.py:59 input_data() root - INFO - Processing completed. New R2 score: 0.8803008999935347
[2024-07-18 09:28:06,421] _internal.py:97 _log() werkzeug - INFO - 127.0.0.1 - - [18/Jul/2024 09:28:06] "POST /input HTTP/1.1" 200 -
My application.py code:
from flask import Flask,request,render_template
import numpy as np
import pandas as pd
from src.exception import CustomException
import sys
from src.logger import logging
from sklearn.preprocessing import StandardScaler
from src.pipeline.predict_pipeline import CustomData,PredictPipeline
from src.pipeline.train_pipeline import RetrainWithNewData
application=Flask(__name__)
app=application
@app.route('/input', methods=['GET', 'POST'])
def input_data():
if request.method == 'GET':
return render_template('home2.html')
else:
try:
file = request.files['file']
# Check if file is present
if file.filename == '':
return render_template('home2.html', error="No file selected")
# Check file extension (assuming you want CSV files)
if not file.filename.lower().endswith('.csv'):
return render_template('home2.html', error="Invalid file type. Please upload a CSV file.")
retrainPipeline = RetrainWithNewData(file)
new_r2_score = retrainPipeline.delete_and_recreate_model()
logging.info(f"Processing completed. New R2 score: {new_r2_score}")
return render_template('home2.html', new_r2_score=new_r2_score)
except Exception as e:
# For unexpected exceptions, you might want to log them and show a generic message
logging.error(f"Unexpected error: {str(e)}")
return render_template('home2.html', error=str(e))
if __name__=="__main__":
app.run(host="0.0.0.0",port=8080)
My home2.html code:
<html>
<body>
<form action="{{ url_for('input_data')}}" method="post" enctype="multipart/form-data">
<h2>Upload Data</h2>
<input type="file" id="file" name="file" required>
<br><br>
<input type="submit" name="upload_submit" value="Upload data in CSV format">
</form>
{% if error %}
<h2 style="color: red;">Error: {{ error }}</h2>
{% endif %}
{% if new_r2_score %}
<h2>The new R2 score is {{ new_r2_score }}</h2>
{% endif %}
</body>
</html>
My train_pipeline.py code:
import sys
import os
import shutil
from src.exception import CustomException
from src.logger import logging
from src.components.data_ingestion import DataIngestion, DataIngestionConfig
from src.components.data_transformation import DataTransformation, DataTransformationConfig
from src.components.model_trainer import ModelTrainer, ModelTrainerConfig
class RetrainWithNewData():
def __init__(self, file):
self.file = file
def delete_and_recreate_model(self):
try:
new_data = self.file
new_raw_data_path = os.path.join(os.getcwd(), "notebook/data")
artifacts_path = os.path.join(os.getcwd(), "artifacts")
# Remove existing directories if they exist
if os.path.exists(new_raw_data_path):
shutil.rmtree(new_raw_data_path)
if os.path.exists(artifacts_path):
shutil.rmtree(artifacts_path)
# Create necessary directories
os.makedirs(new_raw_data_path, exist_ok=True)
os.makedirs(artifacts_path, exist_ok=True)
# Save the new data to a specific file path
new_raw_data_file_path = os.path.join(new_raw_data_path, "stud.csv")
new_data.save(new_raw_data_file_path)
logging.info("Saved new raw data as CSV file")
# Start the data ingestion process
obj = DataIngestion()
train_data, test_data = obj.initiate_data_ingestion()
# Transform the data
data_transformation = DataTransformation()
train_arr, test_arr, _ = data_transformation.initiate_data_transformation(train_data, test_data)
# Train the model
modelTrainer = ModelTrainer()
new_r2_score = float(modelTrainer.initiate_model_trainer(train_arr, test_arr))
logging.info(f"New R2 score is: {new_r2_score}")
return new_r2_score
except Exception as e:
raise CustomException(e, sys)
I would be really grateful for your help.
I expected an output R2 score to be displayed, or an error message to be displayed to the user about incorrect file input.
Harshit Kedia is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
In your flask application, you always return a render_template()
without setting the HTTP code. That’s why even there is an error. The log still prints 200 code in the backend.
You could change the error handling like this:
try:
...
except CustomException as ce:
# Handle custom exceptions
logging.error(f"Custom error: {str(ce)}")
return render_template('home2.html', error=str(ce)), 400
except Exception as e:
# For unexpected exceptions, log the full traceback
logging.error(f"Unexpected error: {str(e)}")
logging.error(traceback.format_exc())
return render_template('home2.html', error="An unexpected error occurred. Please try again later."), 500
This should solve the problem of having conflicting HTTP codes in the webpage and log.
For validating the input file format, you could do it like this way:
def validate_csv_format(file_path):
try:
df = pd.read_csv(file_path, header=None)
if df.shape[1] != 8:
raise ValueError(f"Expected 8 columns, but found {df.shape[1]} columns.")
# Check data types
if not df[0].dtype == 'object': # Check if first column is string (object)
raise ValueError("First column should contain string values.")
if not pd.to_numeric(df[5], errors='coerce').notnull().all(): # Check if 6th column is numeric
raise ValueError("6th column should contain numeric values.")
# Add more specific checks as needed
return True
except Exception as e:
print(f"Error: {str(e)}")
return False
Inside your delete_and_recreate_model()
method, you could check the number of columns of the input file, and you may define the column names if the format is determined and fixed. You could also check the dtype. If the file failed to pass this validation function, then return raise an exception or return an error.
ivan Fan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.