Applying set()
to a list containing multiple NaN values usually removes duplicate NaN entries.
Example:
set([np.nan, 5, np.nan, 17, 5, np.nan, 23])
yields:
{5, 17, nan, 23}
However, I now have a list originating from summing (concatenating) different lists contained in a column of a dataframe; some of these lists contain NaNs. When I apply set()
to the concatenated list retrieved from , it does not remove duplicate NaNs. See the below screenshot:
The object a1
from the screenshot is definitely of list
type. I’m not sure, but maybe it depends on if the NaNs were at some point within a numpy array (or pandas dataframe, presumably):
set([np.nan, np.nan, np.nan])
yields{nan}
set(np.array([np.nan, np.nan]))
yields{nan, nan, nan}
set(list(np.array([np.nan, np.nan])))
also yields{nan, nan, nan}
Any ways to avoid this other than np.unique()
or pd.unique()
? Any reason why set()
would behave like this (I’d assume the expected default would be to remove duplicate NaNs, even if NaN != NaN!)?
4
The reason for what you are seeing is due to a fundamental property of NaN values. Both python (in general) and numpy / pandas in particular implement IEEE 754 semantics for floating point numbers.
The not-a-number values
float('NaN')
anddecimal.Decimal('NaN')
are
special. Any ordered comparison of a number to a not-a-number value is
false. A counter-intuitive implication is that not-a-number values are
not equal to themselves. For example, ifx = float('NaN')
,3 < x
,x < 3
andx == x
are all false, whilex != x
is true. This behavior is
compliant with IEEE 754.
(Source: 6.10.1. Value comparisons)
For more background, this Q&A explains the rationale for the behavior of NaN equality in IEEE 754:
- What is the rationale for all comparisons returning false for IEEE754 NaN values?
By not removing the “duplicate” NaNs, the set()
function is merely being consistent with the python specification of ==
when allied to NaN values.
It is not clear what the best solution would be for you. We don’t have enough context. It depends on what the NaN values mean. Are they a representation of missing data1 or the result of an erroneous calculation? But I would be very suspicious of any scenario where you are turning a list that might include NaN values into a set … whether or not the “duplicate” NaN values are present in the set.
What I am saying is that the best solution for you might be to check for NaN values in the list and raise an exception! Either way, I don’t think either python or numpy / pandas offer an easy way to de-dup the NaN values in the set.
See also: https://github.com/numpy/numpy/issues/9358
1 – Apparently, this is a common use-case for NaN in pandas.
Any ways to avoid this other than
np.unique()
orpd.unique()
?
numpy.nan_to_num
might be useful as
NaN is replaced by zero or by the user defined value in nan keyword
in your case just provide value which will NEVER appear in your data to represent nan, e.g. if all your value are non-negative you might use -1.0 to do so
import numpy as np
arr = np.array([0.0, np.nan, 2.1, np.nan, 5.7, np.nan])
s = set(np.nan_to_num(arr, nan=-1.0))
print(s) # {0.0, 2.1, 5.7, -1.0} where -1.0 denotes nan