I need to apply a compute intensive function, mainly consisting of calls to numpy functions, to a column of a polars DataFrame that has the pl.Array(...)
datatype. I can do it using iter_slices
, creating DataFrames, and concatenating them.
But it really feels like this is something map_batches
is meant to be used for, but it doesn’t handle multiple return values. Is there something else I should look at, or should I call this good enough until I implement a rust UDF?
<code>import polars as pl
import numpy as np
filter1 = np.array([-1,1,-1,1,1])
filter2 = np.array([0.3,0.3,-0.3,0.3, 0.3])
def compute(x):
a = np.dot(x, filter1)
b = np.dot(x, filter2)
return np.sin(a-b), np.cos(a+b)
df = pl.DataFrame({"a":[np.arange(5.0)+i for i in range(5)]}, schema={"a":pl.Array(pl.Float64,5)})
dfs = []
for df_iter in df.iter_slices(10000):
peak_x, peak_y = compute(df_iter["a"].to_numpy())
dfs.append(pl.DataFrame({"c": peak_x, "d": peak_y}))
df_out = pl.concat(dfs)
</code>
<code>import polars as pl
import numpy as np
filter1 = np.array([-1,1,-1,1,1])
filter2 = np.array([0.3,0.3,-0.3,0.3, 0.3])
def compute(x):
a = np.dot(x, filter1)
b = np.dot(x, filter2)
return np.sin(a-b), np.cos(a+b)
df = pl.DataFrame({"a":[np.arange(5.0)+i for i in range(5)]}, schema={"a":pl.Array(pl.Float64,5)})
dfs = []
for df_iter in df.iter_slices(10000):
peak_x, peak_y = compute(df_iter["a"].to_numpy())
dfs.append(pl.DataFrame({"c": peak_x, "d": peak_y}))
df_out = pl.concat(dfs)
</code>
import polars as pl
import numpy as np
filter1 = np.array([-1,1,-1,1,1])
filter2 = np.array([0.3,0.3,-0.3,0.3, 0.3])
def compute(x):
a = np.dot(x, filter1)
b = np.dot(x, filter2)
return np.sin(a-b), np.cos(a+b)
df = pl.DataFrame({"a":[np.arange(5.0)+i for i in range(5)]}, schema={"a":pl.Array(pl.Float64,5)})
dfs = []
for df_iter in df.iter_slices(10000):
peak_x, peak_y = compute(df_iter["a"].to_numpy())
dfs.append(pl.DataFrame({"c": peak_x, "d": peak_y}))
df_out = pl.concat(dfs)