In an example dataframe with NaNs in the third column, let us compare the performance of fillna()
with boolean masking + setting the value:
import pandas as pd
import numpy as np
np.random.seed(100)
nrows = 10000000
nnan = 25000
df = pd.DataFrame(np.random.uniform(0,250000,size=(nrows,3)))
ind_row = np.random.randint(0,nrows,nnan)
df.loc[ind_row, 2] = np.nan
df1 = df.copy()
%timeit df1[2] = df1[2].fillna(999)
df1 = df.copy()
%timeit df1[2].fillna(999)
df1 = df.copy()
%timeit df1.loc[df1[2].isna(),2] = 999
I got the following example timings:
35.1 ms ± 369 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
36.4 ms ± 331 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
6.23 ms ± 65.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and
35.9 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
37.3 ms ± 438 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
6.41 ms ± 38.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Why does manual boolean masking appear to be faster than fillna()
?