I was trying to implement rolling autocorrelation in polars, but got some weird results when there’re null
s involved.
The code is pretty simple. Let’s say I have two dataframes df1
and df2
:
df1 = pl.DataFrame({'a': [1.06, 1.07, 0.93, 0.78, 0.85], 'lag_a': [1., 1.06, 1.07,
0.93, 0.78]})
df2 = pl.DataFrame({'a': [1., 1.06, 1.07, 0.93, 0.78, 0.85], 'lag_a': [None, 1., 1.06, 1.07, 0.93, 0.78]})
You can see that the only difference is that in df2
, the first row for lag_a
is None, because it’s shifted from a
.
When I compute the rolling_corr
for both dataframes, however, I got different results.
# df1.select(pl.rolling_corr('a', 'lag_a', window_size=10, min_periods=5, ddof=1))
shape: (5, 1)
┌──────────┐
│ a │
│ --- │
│ f64 │
╞══════════╡
│ null │
│ null │
│ null │
│ null │
│ 0.622047 │
└──────────┘
# df2.select(pl.rolling_corr('a', 'lag_a', window_size=10, min_periods=5, ddof=1))
shape: (6, 1)
┌───────────┐
│ a │
│ --- │
│ f64 │
╞═══════════╡
│ null │
│ null │
│ null │
│ null │
│ null │
│ -0.219851 │
└───────────┘
The result from df1
, i.e. 0.622047 is what I got from numpy.corrcoef
as well. I wonder where the -0.219851 is coming from.