Thiết kế website giá rẻ

Question

I am Working on building a Neural network using Home Loan Repayment Data in PySpark. Did EDA, Data-preprocessing, and feature engineering Steps but when it comes to the model i am getting the error:

java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch!

The code is as it follows:

# List of columns to be cast to numeric -> Solve Error of String is not supported
columns_to_cast = ['AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
                   'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']

for col_name in columns_to_cast:
    train_df = train_df.withColumn(col_name, col(col_name).cast("double"))
    test_df = test_df.withColumn(col_name, col(col_name).cast("double"))

print("Casting complete. Checking schema:")
train_df.printSchema()

print("Assembling feature vector...")
# Assemble feature vector
ohe_columns = [col + '_ohe' for col in categorical_columns]
numerical_columns = [col for col in train_df.columns if col not in ohe_columns + ['SK_ID_CURR', 'TARGET']]

print(f"Numerical columns: {numerical_columns}")
print(f"One-hot encoded columns: {ohe_columns}")

assembler = VectorAssembler(inputCols=numerical_columns + ohe_columns, outputCol="features")
train_df = assembler.transform(train_df)
test_df = assembler.transform(test_df)
print("Feature vector assembly complete.")

print("Checking assembled feature vector schema:")
train_df.select("features").show(5, truncate=False)

print("Scaling features...")
# Scale the features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
scaler_model = scaler.fit(train_df)
train_df = scaler_model.transform(train_df)
test_df = scaler_model.transform(test_df)
print("Scaling complete.")

print("Checking scaled features schema:")
train_df.select("scaled_features").show(5, truncate=False)

print("Defining the neural network structure...")
# Define the layers of the neural network
num_features = len(numerical_columns + ohe_columns)
print(f"Number of input features: {num_features}")

layers = [
    num_features,  # Number of input features
    64,            # Hidden layer size
    32,            # Hidden layer size
    2              # Number of classes
]

print(f"Neural network layers: {layers}")

# Initialize the Multilayer Perceptron Classifier
mlp = MultilayerPerceptronClassifier(
    featuresCol='scaled_features',
    labelCol='TARGET',
    maxIter=100,
    layers=layers,
    blockSize=128,
    seed=1234
)

print("Training the model...")
# Train the model
mlp_model = mlp.fit(train_df)
print("Model training complete.")

print("Making predictions on the training set...")
# Make predictions on the training set (for evaluation purposes)
train_predictions = mlp_model.transform(train_df)

print("Evaluating the model...")
# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol='TARGET', rawPredictionCol='rawPrediction', metricName='areaUnderROC')
auc_train = evaluator.evaluate(train_predictions)
print(f'Training AUC: {auc_train}')

print("Making predictions on the test set...")
# Make predictions on the test set
test_predictions = mlp_model.transform(test_df)

# Show the predictions
test_predictions.select('SK_ID_CURR', 'prediction', 'probability').show()

print("Preparing the submission file...")
# Prepare the submission file
submission = test_predictions.select('SK_ID_CURR', 'prediction')
submission.show()

# Save the submission to a CSV file
submission.write.csv('./prediction', header=True)
print("Submission file saved.")

Note that i have added Printing Statements to help me identify where the error occurs and as it seems, the error is produced in the Evaluation Process.

The printing statemets:

Defining the neural network structure…
Number of input features: 86
Neural network layers: [86, 64, 32, 2]
Training the model…
Model training complete.
Making predictions on the training set…
Evaluating the model…

(The error is coming here)

What could this mean?

Thiết kế website giá rẻ

Danh mục

PySpark error requirement failed: A & B Dimension mismatch