I am trying to perform some aggregations, while I loved polars, there are certain things, which I am unable to perform. Here are my approach and question for reference.
import polars as pl
import polars.selectors as cs
import numpy as np
data = pl.DataFrame({'x': ['a', 'b', 'a', 'b', 'a', 'a', 'a', 'b', 'a'],
'y': [2, 3, 4, 5, 6, 7, 8, 9, 10],
'z': [4, np.nan, np.nan, 8,1, 1, 3, 4, 0],
'm' : [np.nan, 8, 1, np.nan, 3, 4, 8, 7, 1]})
I have a dataframe like above. Here are my questions and corresponding attempt
- How to calculate multiple summaries on multiple columns (I get duplicate column error, how do I fix this?)
Attempt:
data.group_by('x').agg(pl.all().mean(),
pl.all().sum())
- why median is coming as valid value but mean isn’t?
possible answer: is it because median is calculated by sorting and selecting middle value and since in this case central value is not null hence it is valid (not sure if this the reason)
print(data.select(pl.col('m').median())) ## line 1
print(data.select(pl.col('m').mean())) ## line 2
-
If I replace
np.nan
withNone
the mean calculation works fine on “line 2” in the above code, why? -
why does this doesn’t work? I get a compute error, which says : expanding more than one
col
is not allowed, what does it really mean? Bascially I wanted to filter any rows which has missing in either columns
data.filter(pl.col(['z']).is_nan() | pl.col(['m']).is_nan())
- How do I replace
NaN
in multiple columns in one go, I wrote this code and it works too, but its clunky, is there any better way?
mean_impute = np.nanmean(data.select(pl.col(['z', 'm'])).to_numpy(), axis=0)
def replace_na(data, colname, i):
return data.with_columns(pl.when(pl.col(colname).is_nan()
).then(mean_impute[i]).otherwise(pl.col(colname)).alias(colname)).select(colname).to_numpy().flatten()
data.with_columns(z = replace_na(data, 'z', 0),
m = replace_na(data, 'm', 1))
Thanks for reading the question and answering. I don’t want to put a duplicate entry in SO. I understand the rules, so please let me know if these are duplicates in any sense. I would gladly delete them. But I couldn’t able to solve some of these or written a solution which might not be great. Thanks again !!!