I encountered a strange error. All of the value of a column become completed same after toPandas().
I have a binary classification dataset:
dataframe schema:
| id | feature | label |
| xx | xxxx | 0 |
| xx | xxxx | 1 |
The dataframe are separated into positive and negative by labels, and processed differently, followed by union. However, the value of label become complete same (all 1s) after toPandas() operation. The code is like the following:
def sampling(df, k):
w = Window.partitionBy('id').orderBy(F.rand(1234))
df = df.withColumn('row', F.row_number().over(w)).filter(F.col('row')<=k).drop('row')
return df
#separate df into two
pos_df = df.filter(F.col('label')==1)
neg_df = df.filter(F.col('label')==0)
#apply different processing
pos_sample_df = sampling(pos_df, 5)
neg_sample_df = sampling(neg_df, 4)
#union
df = pos_sample_df.union(neg_sample_df)
pdf = df.toPandas()
I found the label column of pdf become all 1s while the pdf2 works correctly. I cannot understand what happened?
I’ve tried debug the code and finally found if the order of union changed. It would works well. I cannot understand what happened.
df2 = neg_sample_df.union(pos_sample_df)
pdf2 = df2.toPandas() # works well
If I directly check the df before toPandas(), the label column still have both 1 or 0 values.
Ya Xiao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.