I’m using Polars to process a dataset where I need to create unique labels from two columns and then perform joins to get indices for those labels. However, I noticed that if I perform the joins directly on the LazyFrame, the indices seem to be incorrect. When I collect the LazyFrame into a DataFrame before performing the joins, the indices are correct.
Here are the relevant code snippets:
- Creating the labels LazyFrame:
import polars as pl
# Assume data is a LazyFrame
data = pl.scan_csv(
source=source_filepath,
separator='t',
has_header=False,
)
# Concatenate col_1 and col_2, and get unique labels with index
labels = (
pl.concat([
data.select(pl.col("col_1").alias("label")),
data.select(pl.col("col_2").alias("label"))
])
.unique(keep="first")
.with_row_count(name="label_index")
)
- Joining without collecting (This gives incorrect indices):
# Join to get index_1
data = data.join(
labels,
left_on="col_1",
right_on="label",
).rename({"label_index": "index_1"})
# Join to get index_2
data = data.join(
labels,
left_on="col_2",
right_on="label",
).rename({"label_index": "index_2"})
result = data.select(["index_1", "index_2", "col_3"])
result_df = result.collect()
- Joining after collecting labels (This gives correct index_1 and index_2 values):
# Collect labels and data LazyFrame to DataFrame
labels_df = labels.collect()
data_df = data.collect()
# Join to get index_1
data_df = data_df.join(
labels_df,
left_on="col_1",
right_on="label",
how="left"
).rename({"label_index": "index_1"})
# Join to get index_2
data_df = data_df.join(
labels_df,
left_on="col_2",
right_on="label",
how="left"
).rename({"label_index": "index_2"})
result_df = data_df.select(["index_1", "index_2", "col_3"])
Why is there this discrepancy between using a LazyFrame directly for joins and collecting it before performing the joins? How can I ensure correct behavior without needing to collect the LazyFrame prematurely?
Any insights into why this happens and how to resolve it would be greatly appreciated!