Would appreciate some optimization tips here.
I have a pandas dataframe which has about 4500 columns and 50000 rows.
There are 2 kind of columns, ones which start with ‘A’ and one which start with ‘B’ .
example columns
[id, A.sellervalue, A.buyervalue, A.ratingvalue, B.sellervalue, B.buyervalue, B.ratingvalue]
I have written this code block which would compare the values for the columns with the same suffix but different prefix (A.sellervalue vs B.sellervalue, A.buyervalue vs B.buyervalue etc)
and create a new dataframe which would identify the mismatching row by column name and corresponding values.
This is not very efficient.
I was working if there is any optimization I can do here with either Pandas methods or by using pyspark.
Thank You
def find_mismatch_common_columns_values(df, A_cols_without_prefix, B_cols_without_prefix):
set_A_cols_without_prefix = set(A_cols_without_prefix)
set_B_cols_without_prefix = set(B_cols_without_prefix)
common_columns = list(set_otf_cols_without_prefix.intersection(set_dw_cols_without_prefix))
print("n Size of common columns between both ",len(common_columns))
result_list = []
for col in common_columns:
A_col = f'A.{col}'
B_col = f'B.{col}'
for idx, row in df.iterrows():
A_value = row[A_col]
B_value = row[B_col]
if isinstance(A_value, float) :
A_value = round(float(A_value), 6)
if isinstance(B_value, float):
B_value = round(float(B_value), 6)
if A_value != B_value and (not pd.isnull(A_value) and not pd.isnull(B_value)):
result_list.append({
'row': idx,
'column_name': col,
'A_value': A_value,
'B_value': B_value
})
result_df = pd.DataFrame(result_list)
return result_df