I’m trying to reproduce pandas.cut
binning using polars.Series.cut
.
MRE of values and breakpoints:
scores = [1111, 65, 88, -1111, 92]
breaks = [0, 50, 60, 70, 80, 90, 100]
With pandas.cut
, the bin is null if the value is outside the defined edges:
df = pd.DataFrame({'score': scores})
df['bin'] = pd.cut(df['score'], breaks)
# score bin
# 0 1111 NaN <- null in pandas
# 1 65 (60.0, 70.0]
# 2 88 (80.0, 90.0]
# 3 -1111 NaN <- null in pandas
# 4 92 (90.0, 100.0]
But with polars.Series.cut
, it seems we’re forced to include the inf
and -inf
bins:
df = pl.DataFrame({'score': scores})
df.with_columns(bin=pl.col('score').cut(breaks))
# shape: (5, 2)
# ┌───────┬────────────┐
# │ score ┆ bin │
# │ --- ┆ --- │
# │ i64 ┆ cat │
# ╞═══════╪════════════╡
# │ 1111 ┆ (100, inf] │ <- not null in polars
# │ 65 ┆ (60, 70] │
# │ 88 ┆ (80, 90] │
# │ -1111 ┆ (-inf, 0] │ <- not null in polars
# │ 92 ┆ (90, 100] │
# └───────┴────────────┘
I wasn’t able to set them to np.nan
manually either:
df.with_columns(
bin=pl.when(pl.col('bin').cast(pl.Utf8).str.contains('inf'))
.then(np.nan)
.otherwise(pl.col('bin'))
)
# ComputeError: failed to determine supertype of f64 and cat
How can we replicate the pandas.cut
bins in Polars?