I am Working on building a Neural network using Home Loan Repayment Data in PySpark. Did EDA, Data-preprocessing, and feature engineering Steps but when it comes to the model i am getting the error:
java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch!
The code is as it follows:
# List of columns to be cast to numeric -> Solve Error of String is not supported
columns_to_cast = ['AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
for col_name in columns_to_cast:
train_df = train_df.withColumn(col_name, col(col_name).cast("double"))
test_df = test_df.withColumn(col_name, col(col_name).cast("double"))
print("Casting complete. Checking schema:")
train_df.printSchema()
print("Assembling feature vector...")
# Assemble feature vector
ohe_columns = [col + '_ohe' for col in categorical_columns]
numerical_columns = [col for col in train_df.columns if col not in ohe_columns + ['SK_ID_CURR', 'TARGET']]
print(f"Numerical columns: {numerical_columns}")
print(f"One-hot encoded columns: {ohe_columns}")
assembler = VectorAssembler(inputCols=numerical_columns + ohe_columns, outputCol="features")
train_df = assembler.transform(train_df)
test_df = assembler.transform(test_df)
print("Feature vector assembly complete.")
print("Checking assembled feature vector schema:")
train_df.select("features").show(5, truncate=False)
print("Scaling features...")
# Scale the features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
scaler_model = scaler.fit(train_df)
train_df = scaler_model.transform(train_df)
test_df = scaler_model.transform(test_df)
print("Scaling complete.")
print("Checking scaled features schema:")
train_df.select("scaled_features").show(5, truncate=False)
print("Defining the neural network structure...")
# Define the layers of the neural network
num_features = len(numerical_columns + ohe_columns)
print(f"Number of input features: {num_features}")
layers = [
num_features, # Number of input features
64, # Hidden layer size
32, # Hidden layer size
2 # Number of classes
]
print(f"Neural network layers: {layers}")
# Initialize the Multilayer Perceptron Classifier
mlp = MultilayerPerceptronClassifier(
featuresCol='scaled_features',
labelCol='TARGET',
maxIter=100,
layers=layers,
blockSize=128,
seed=1234
)
print("Training the model...")
# Train the model
mlp_model = mlp.fit(train_df)
print("Model training complete.")
print("Making predictions on the training set...")
# Make predictions on the training set (for evaluation purposes)
train_predictions = mlp_model.transform(train_df)
print("Evaluating the model...")
# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol='TARGET', rawPredictionCol='rawPrediction', metricName='areaUnderROC')
auc_train = evaluator.evaluate(train_predictions)
print(f'Training AUC: {auc_train}')
print("Making predictions on the test set...")
# Make predictions on the test set
test_predictions = mlp_model.transform(test_df)
# Show the predictions
test_predictions.select('SK_ID_CURR', 'prediction', 'probability').show()
print("Preparing the submission file...")
# Prepare the submission file
submission = test_predictions.select('SK_ID_CURR', 'prediction')
submission.show()
# Save the submission to a CSV file
submission.write.csv('./prediction', header=True)
print("Submission file saved.")
Note that i have added Printing Statements to help me identify where the error occurs and as it seems, the error is produced in the Evaluation Process.
The printing statemets:
Defining the neural network structure…
Number of input features: 86
Neural network layers: [86, 64, 32, 2]
Training the model…
Model training complete.
Making predictions on the training set…
Evaluating the model…
(The error is coming here)
What could this mean?