I wish to convert this snipper of pandas code into polars code to learn polars and see if I can benefit w.r.t speed performances:
df_list = []
for datum in data:
df = pd.DataFrame()
temp_data = datum.data # list of tuples of numpy ndarray
df["A"] = temp_data[1].ravel().astype(np.float32)
df["B"] = temp_data[2].ravel().astype(np.float32)
df["C"] = datum.analDate # datetime.datetime
df["D"] = datum.validDate.replace(hour=int(datum.validityTime/100))
for name in names:
df[name] = temp_data[0].ravel().astype(np.float32)
df_list.append(df)
df = pd.concat(df_list)
df['E'] = df['C'].dt.hour
The idea is that I have many files in a custom binary format. After reading them with a custom reader, I can access some of those fields which are either numpy ndarray or datetime.datetime. I wish to iterate the files and save them into a list, and at the end concatenate them together lazily.
Being new to Polars I wish to understand:
- The best way to create Polars dateframe from that memory without copy data if possible.
- How to deal with “broadcasting”. For instance, considering columns
C
andD
. Do I have to create a list with a singledatetime.datetime
instance and multiple it by the lenght of the other columns? - How to deal with duplicate columns name. For instance
name
may be a duplicate innames
. How does Polars manage it?
0
You can start by looking at https://docs.pola.rs/py-polars/html/reference/api/polars.from_numpy.html and https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.lazy.html.
For your example I can give you a little snippet of what you can try.
import polars as pl
import numpy as np
df_list = []
for datum in data:
temp_data = datum.data
df = pl.DataFrame(
{
"A": pl.from_numpy(temp_data[1].ravel().astype(np.float32)),
"B": pl.from_numpy(temp_data[2].ravel().astype(np.float32)),
"C": pl.datetime(pl.from_numpy(temp_data[0].ravel().astype(np.int64))),
"D": pl.datetime(pl.from_numpy(np.array([datum.validDate.replace(hour=int(datum.validityTime/100))], dtype=np.int64))),
}
)
for name in names:
df[name] = pl.from_numpy(temp_data[0].ravel().astype(np.float32))
df_list.append(df)
Then you can try concatenating the date using df = pl.lazy.concat(df_list)
.
And you can try broadcasting datetime using the hour method df = df.with_column(pl.col("C").hour().alias("E"))
.