Modin DataFrame Merge Issue After dropna on Categorical Column:
I’m encountering an issue when using Modin to merge DataFrames that contain categorical columns. The issue arose after I performed a dropna operation on one of the categorical columns. Specifically, when I used the following code:
X = X.dropna(subset=[key])
instead of
X = X.loc[pd.notnull(X[key])]
The error traceback I’m getting when I try to merge the DataFrames is:
AttributeError: 'numpy.ndarray' object has no attribute 'categories'
This error occurs when Modin attempts to merge the DataFrames containing categorical columns. The stack trace suggests that something went wrong with handling the categorical type after the dropna operation.
Expected Behavior:
I expect the merge to work as intended, preserving the categorical dtype for the category_column.
Actual Behavior:
Instead, I get the following error during the merge:
AttributeError: 'numpy.ndarray' object has no attribute 'categories'
Merge operation did on 2 columns both of them are of type ‘Category‘
The error occurs due to Modin’s handling of categorical data after using dropna to filter rows.
This behaviour is not seen in pandas.
Environment:
- Modin version: 0.30.0
- Pandas version: 2.2.2
- Ray version: 2.5.0
- Python version: 3.9.13
Additional Context:
-
The issue is specific to how Modin (with Ray) handles categorical
columns after applying dropna.After filtering with dropna(subset=[key]), it seems Modin loses the
categorical type, which leads to issues when merging with other
DataFrames that have categorical columns. I have confirmed that using
the alternative approach,X = X.loc[pd.notnull(X[key])]
, avoids the
error and works as expected.
What I’ve Tried:
Switching to loc[pd.notnull(…)]: I confirmed that using X = X.loc[pd.notnull(X[key])] avoids the issue and the merge works correctly.
Testing with other filtering methods: Other ways of filtering (like dropna with different columns or without categorical types) do not trigger the error.