If we have a (part of a bigger) dataframe that shows what states individuals (rows) visited in a trip:
df = pd.DataFrame({'states_visited': [['NY', 'CA'], 'CA', 'CA']}, index = ['John', 'Mary', 'Joe'])
states_visited
John [NY, CA]
Mary CA
Joe CA
Because the states_visited
column has values of type list
, we can’t use usual pandas methods like:
df['states_visited'].unique()
to get the expected outcome of: [[NY, CA] , CA]
which is what we need for example if we want to know what states were visited together (NY and CA) and which states were visited separately (CA).
Instead, we get TypeError: unhashable type: 'list'
Similarly, we can’t do df['states_visited'].str.contains('NY')
to know who visited NY in his/her trip (it will return NaN
for John). To achieve this, we have to go through the hoop of something like:
df['states_visited'].explode()
.dropna()
.str.contains('NY')
.groupby(level=0).any()
.astype(float)
)
So, if pandas does not like nested data, what is the proper way of having list-like values in a pandas dataframe/series (not simple dictionaries because this example data is a part of a bigger dataframe)?