I have the following data:
data = [{'Shape': 'Circle', 'Color': 'Green'}, {'Shape': 'Circle', 'Color': 'Green'}, {'Shape': 'Circle', 'Color': 'Green'}]
Which I create a DataFrame from:
df = pd.DataFrame(data)
Giving:
>>> df
Shape Color
0 Circle Green
1 Circle Green
2 Circle Green
The data is always received in this form, and I cannot change it.
Now I need to count the Color
column for each Shape
, with the Color
as the index, like this:
Circle Square
Green 3 0
Red 0 0
However, while I know that Shape
can be either Circle
or Square
, they may or may not be present in the data. Likewise, Color
can be either Green
or Red
, but also may or may not be present in the data.
So at the moment my solution is to use:
df2 = pd.DataFrame(
[
{
"Color": "Green",
"Circle": len(np.where((df["Shape"] == "Circle") & (df["Color"] == "Green"))[0]),
"Square": len(np.where((df["Shape"] == "Square") & (df["Color"] == "Green"))[0]),
},
{
"Color": "Red",
"Circle": len(np.where((df["Shape"] == "Circle") & (df["Color"] == "Red"))[0]),
"Square": len(np.where((df["Shape"] == "Square") & (df["Color"] == "Red"))[0]),
},
]
)
df2 = df2.set_index("Color")
df2.index.name = None
Which gives the desired result:
>>> df2
Circle Square
Green 3 0
Red 0 0
But I suspect this is inefficient. Is there a better way of doing this in Pandas directly? I tried a pivot_table, but couldn’t get it to account for possible missing values in the data.