I’m working with a Polars DataFrame and trying to clean up a column by applying multiple string operations. The first operation I need to do is a str.replace()
to fix some inconsistencies in the string, and then I want to extract several values into new columns.
My current approach:
df = pl.DataFrame(
{
"engine": ["172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel",
"3.0 Liter Twin Turbo",
"429.0HP 5.0L 8 Cylinder Engine Gasoline Fuel"],
}
)
(
df
.with_columns(
pl.col("engine").str.replace(r'sLiter', "L")
)
.with_columns(
pl.col("engine").str.extract(r'(d+).d+HP',1).alias("HP"),
pl.col("engine").str.extract(r'(d+.d+)L',1).alias("Displacement"),
pl.col("engine").str.extract(r'(d+)sCylinder',1).alias("Cylinder"),
)
)
Since I’m going to apply multiple operations over the main dataframe, I want to create a function to make this code more reusable and cleaner. This is the function-based approach I’ve come up with:
Approach with function:
def get_engine(engine_col: pl.Expr) -> pl.Expr:
return (
pl.col("engine").str.extract(r'(d+).d+HP',1).alias("HP"),
pl.col("engine").str.extract(r'(d+.d+)L',1).alias("Displacement"),
pl.col("engine").str.extract(r'(d+)sCylinder',1).alias("Cylinder"),
pl.col("engine").str.contains("Electric").alias("Electric")
)
(
df
.with_columns(
pl.col("engine").str.replace(r'sLiter', "L")
)
.with_columns(
get_engine(engine_col=pl.col("engine"))
)
)
Is there a better or more efficient way to combine these operations while keeping the code clean?