I want to update a specific column at a specific row index range.
Here’s what I want to achieve in pandas:
df = pd.DataFrame({ "foo": [0,0,0,0] })
df["foo"].iloc[0:3] = 1
# or
df.iloc[0:3, df.columns.get_loc("foo")] = 1
How can I achieve this seemingly simple operation in polars? It seems to be possible to update a single row with the following:
df = pl.DataFrame({ "foo": [0,0,0,0] })
df[0, "foo"] = 1
but trying to update a range fails:
df[0:3, "foo"] = 1
# TypeError: cannot use "slice(0, 3, None)" for indexing
df[0:3]["foo"] = 1
# TypeError: DataFrame object does not support `Series` assignment by index
The recommended answer of using pl.when(pl.col("row_number").between(...)).then(...)
adds a significant overhead that shouldn’t be needed considering that the row number is sequential, ordered, and starting at 0. On a dataset with a million row, I’m seeing a 20x difference in performance between pandas df.iloc[...] = x
and the current polars solution. Is there really no alternative ?
3
After some testing, the fastest solution by far is to build the new column ourselves, by stiching together the old values and the ones we want to update:
start, stop, value = 0, 3, -1
df = df.with_columns(
foo=data[0 : start, "foo"]
.append(pl.repeat(value, stop - start, eager=True, dtype=data["foo"].dtype))
.append(data[stop : len(data), "foo"]),
)
However, it results in a series with 3 chunks, which might make further computations slower. All the other following solutions result in a uniquely chunked series, but are 10x slower.
Using scatter
.
df = df.with_columns(df["foo"].scatter(range(start, stop), value))
Using extend
instead of append
:
df = df.with_columns(
foo=data[0 : start, "foo"]
.extend(pl.repeat(value, stop - start, eager=True, dtype=data["foo"].dtype))
.extend(data[stop : len(data), "foo"]),
)
Using when
df = df.with_row_index().with_columns(
pl.when(pl.col("index").is_between(start, stop))
.then(value)
.otherwise(pl.col("foo")),
)
Using update
.
df = df.with_row_index().update(
pl.select(
pl.int_range(start, stop, dtype=pl.UInt32).alias("index"),
pl.lit(value),
),
on="index",
)
6
If your dataframe has an index (as it can be added with pl.DataFrame.with_row_index
), you can use a pl.when().otherwise()
construct to selectively overwrite the values of a certain column.
(
df
.with_row_index()
.with_columns(
pl.when(pl.col("index").is_between(1, 3)).then(1).otherwise("foo").name.keep()
)
)
shape: (4, 2)
┌───────┬─────┐
│ index ┆ foo │
│ --- ┆ --- │
│ u32 ┆ i64 │
╞═══════╪═════╡
│ 0 ┆ 0 │
│ 1 ┆ 1 │
│ 2 ┆ 1 │
│ 3 ┆ 0 │
└───────┴─────┘
If you don’t want to add an index to your dataframe, you can create it on-the-fly by using pl.int_range
.
df.with_columns(
pl.when(
pl.int_range(pl.len()).is_between(1, 2)
).then(
1
).otherwise(
"foo"
).name.keep()
)
shape: (4, 1)
┌─────┐
│ foo │
│ --- │
│ i64 │
╞═════╡
│ 0 │
│ 1 │
│ 1 │
│ 0 │
└─────┘
3