A contains columns x, y1, y2, y3 & y4.
I am interested in studying the {y1,y2,y3,y4} sequence w.r.t x.
To find unique {y1,y2,y3,y4} sequence occurring for each x, I do the following:
B = pd.DataFrame()
for x_temp in A['x'].unique():
B = pd.concat([B, A[B['x'] == x_temp][['x','y1','y2','y3','y4']]])
B = B.drop_duplicates().sort_values(by=['x','y1','y2','y3','y4'])
del x_temp
I want to introduce a new column called ‘count’ in B, which contains # of unique {y1,y2,y3,y4} that occurred for that specific x in A.
B['count'] = A.apply(lambda row: (A['y1'] == row['y1']) & (A['y2'] == row['y2']) & (A['y3'] == row['y3']) & (A['y4'] == row['y4']), axis=1).sum()
This works, however, it doesn’t work if A or B has missing values. I want it to treat missing values also as a unique value.
Joshua Roy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.