train-4.csv has 200000 rows and roughly 200 columns. train-4.csv has as its first column, either the ground truth 0 or the ground truth 1. aux-n.csv where n>0 is meant to store what happens when we apply the solution for train-4 in the algorithm.
In particular, I am trying to compete in the closed competition here:
https://www.kaggle.com/competitions/santander-customer-transaction-prediction
When I get the optimal bias coefficients and apply them to the test.csv, my accuracy is around 1/2, meaning that my answer is the same as if I had randomly guessed; very displeasing.
Here is my code so far:
import pandas as pd
import numpy as np
from scipy.optimize import minimize
from sklearn.linear_model import LinearRegression
# Load training data
train_data_path = "G:\My Drive\bak 2024\datasets\Santander Customer Transaction Prediction\train-4.csv"
df = pd.read_csv(train_data_path)
# Extract the first column of df
first_column = df.iloc[:, 0]
# Extract the remaining columns of df
remaining_columns = df.iloc[:, 1:]
def correlation_function(coefficients):
dot_product_coefficients = remaining_columns.dot(coefficients)
correlation = np.corrcoef(first_column, dot_product_coefficients)[0, 1]
return -np.abs(correlation) # Return scalar instead of array
# Generate initial guess for coefficients as random numbers between -1 and 1
np.random.seed(0) # for reproducibility
initial_guess = np.random.uniform(-1, 1, len(remaining_columns.columns))
# Minimize negative absolute correlation to maximize absolute correlation
result = minimize(correlation_function, initial_guess, method="Powell")
# Calculate the dot product of remaining_columns and optimal coefficients
optimal_coefficients = result.x
dot_product_coefficients = remaining_columns.dot(optimal_coefficients)
# Fit linear regression model
model = LinearRegression()
model.fit(dot_product_coefficients.values.reshape(-1, 1), first_column)
# Apply best linear fit function to dot_product_coefficients
pf = model.predict(dot_product_coefficients.values.reshape(-1, 1))
# Apply threshold
t = 0.34
thresholded_values = np.where(pf >= t, 1, 0)
# Calculate accuracy
accuracy_ratio = np.mean(first_column == thresholded_values)
print("Accuracy ratio as a percent:", 100 * accuracy_ratio)
output_csv_path = "G:\My Drive\bak 2024\datasets\Santander Customer Transaction Prediction\aux-11.csv"
thresholded_series = pd.Series(thresholded_values, name='thresholded_values')
thresholded_series.to_csv(output_csv_path, index=False)
print(f"Thresholded values exported to: {output_csv_path}")
With this, aux-11.csv just ends up being all 0’s which is wrong. There should be some 0’s and some 1’s.
What can I do to make it right? TIA
I have tried numerous things and I was expecting a prediction csv that would show what the training model is when applied to the test.csv.
user25070173 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.