I’m wrestling with the following:
I have a dataframe with 2 columns of float values that may include NaNs.
For example:
In [5]: df = pd.DataFrame({'vals1': [10,20,25,15,np.nan, 2], 'vals2': [5, 11, 12, np.nan, np.nan, np.nan]})
In [6]: df
Out[6]:
vals1 vals2
0 10.0 5.0
1 20.0 11.0
2 25.0 12.0
3 15.0 NaN
4 NaN NaN
5 2.0 NaN
I would like to create “bins” using vals1
and then plot a bar graph with the value counts for both vals1
and vals2
.
The critical point is that I would like to reuse the bins created from vals1
in such a way that the value counts can be plotted and the NaNs are plotted along as a seperate category/bin.
Without the NaNs I can do this:
In [7]: bins = sorted(pd.cut(df['vals1'], 3).value_counts(dropna=True).index)
In [8]: bins
Out[8]:
[Interval(1.977, 9.667, closed='right'),
Interval(9.667, 17.333, closed='right'),
Interval(17.333, 25.0, closed='right')]
In [9]: pd.cut(df['vals2'], bins=bins)
Out[9]:
0 (1.977, 9.667]
1 (9.667, 17.333]
2 (9.667, 17.333]
3 NaN
4 NaN
5 NaN
Name: vals2, dtype: category
Categories (3, interval[float64, right]): [(1.977, 9.667] < (9.667, 17.333] < (17.333, 25.0]]
In [10]: plt.figure()
plt.bar([str(b) for b in bins], pd.cut(df['vals1'], bins=bins).value_counts().sort_values(), label='vals1', alpha=0.4)
plt.bar([str(b) for b in bins], pd.cut(df['vals2'], bins=bins).value_counts().sort_values(), label='vals2', alpha=0.4)
plt.legend()
This plots the non-NaN values nicely (see below).
Question: But is there a way to dd the NaN as a “category” or “bin” in an out-of-the-box way?
You could modify your approach to use dropna=False
in value_counts
and the resulting Series’ index instead of the list comprehension on the bins:
s1 = pd.cut(df['vals1'], bins=bins).value_counts(dropna=False).sort_values()
s2 = pd.cut(df['vals2'], bins=bins).value_counts(dropna=False).sort_values()
plt.figure()
plt.bar(s1.index.astype(str), s1, label='vals1', alpha=0.4)
plt.bar(s2.index.astype(str), s2, label='vals2', alpha=0.4)
plt.legend()
Output:
Note. I’m not sure why you used .sort_values()
, be aware that this will only keep the order of the first Series. Maybe you meant .sort_index(na_position='last')
?.
1