I am working on a binary classification problem with a highly balanced dataset(majority class 0: 523152826, and minority class 1: 2711142)
I tried the logistic regression model from pyspark.ml.classification with a parameter weightCol=”classWeight” :
balancingRatio = final_df2.filter(col('label') == 1.0 ).count() / final_df2.count()
calculateWeights = udf(lambda x:1 * balancingRatio if x == 0 else (1 * (1.0 - balancingRatio)), DoubleType())
weightedDataset = df.withColumn("classWeightCol", calculateWeights('label'))
However, I get the same accuracy (accuracy=1) before and after adding the weight, does someone know the issue here ? Thank you!!