Suppose I have three columns in a pandas DataFrame without any null or empty values.
- Facility always has unique values per item.
- An item can have one or more vendors associated with it.
- The same vendor can show up more than once for different facilities for a given item.
- A facility is never associated with more than one vendor for a given item.
- Item values are ordered.
- Every row is unique.
Example of the data:
Column Vendor | Column Item | Column Facility |
---|---|---|
V1 | I1 | F1 |
V1 | I1 | F2 |
V2 | I1 | F4 |
V1 | I2 | F1 |
V1 | I2 | F2 |
V2 | I2 | F3 |
V3 | I2 | F5 |
V3 | I2 | F6 |
V3 | I3 | F3 |
V1 | I4 | F2 |
V4 | I4 | F4 |
V4 | I4 | F5 |
V1 | I5 | F1 |
V1 | I5 | F4 |
How do I create a DataFrame that contains the following?: “For every item, if there is more than one vendor associated with that item, give me all rows for that item (otherwise exclude all rows associated with that item).”
In this example, I would want a DataFrame of all rows for I1, I2, and I4 because they have more than 1 unique vendor associated with them, and I would exclude all rows involving I3 and I5 as there is only one vendor associated with them.
I tried to create a list of unique vendors for each item loop with a conditional of len(Appended_Vendor_List_var) > 1
, but so far no luck.
1
You can use a groupby.transform('nunique')
and boolean indexing:
out = df[df.groupby('Column Item')['Column Vendor'].transform('nunique').gt(1)]
Output:
Column Vendor Column Item Column Facility
0 V1 I1 F1
1 V1 I1 F2
2 V2 I1 F4
3 V1 I2 F1
4 V1 I2 F2
5 V2 I2 F3
6 V3 I2 F5
7 V3 I2 F6
9 V1 I4 F2
10 V4 I4 F4
11 V4 I4 F5
Intermediate:
Column Vendor Column Item Column Facility nunique nunique>1
0 V1 I1 F1 2 True
1 V1 I1 F2 2 True
2 V2 I1 F4 2 True
3 V1 I2 F1 3 True
4 V1 I2 F2 3 True
5 V2 I2 F3 3 True
6 V3 I2 F5 3 True
7 V3 I2 F6 3 True
8 V3 I3 F3 1 False
9 V1 I4 F2 2 True
10 V4 I4 F4 2 True
11 V4 I4 F5 2 True
12 V1 I5 F1 1 False
13 V1 I5 F4 1 False
Variant using drop_duplicates
and isin
:
drop = (df.drop_duplicates(['Column Vendor', 'Column Item'])['Column Item']
.drop_duplicates(keep=False)
)
out = df[~df['Column Item'].isin(drop)]
1