Note to potential editors: please leave both “Python” and “Polars” in the question title, because:
- there are many questions about looking up values in another dataframe in the
pandas
context; - not everyone (e.g. search engines, or beginners) knows how to use the
[python-polars]
tag to drill down to polars specific questions.
Python – Pandas x Polars – Values mapping (Lookup value) discusses as a solution:
import numpy as np
import polars as pl
# Dimension Table
letters_ids = pl.DataFrame({
'Letters': ['A', 'B', 'C'],
'Letters_id': [1, 2, 3]
})
# Fact Table
many_letters = pl.DataFrame({
'Letters': np.random.choice(['A', 'B', 'C'], 10)
})
# Convert the two columns DataFrame into a Python's dictionary.
letters_dict = dict(letters_ids.iter_rows())
# Maps the dictionary
many_letters = many_letters.with_columns(
pl.col('Letters').map_dict(letters_dict).alias('letters_mapped')
)
But as far as I can see with the latest version of polars
for Python, map_dict
is not an option:
--------Version info---------
Polars: 1.1.0
Index type: UInt32
Platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.35
Python: 3.12.3 (main, Apr 15 2024, 18:25:56) [Clang 17.0.6 ]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fastexcel: <not installed>
fsspec: <not installed>
gevent: <not installed>
great_tables: <not installed>
hvplot: 0.10.0
matplotlib: 3.9.0
nest_asyncio: 1.6.0
numpy: 2.0.0
openpyxl: <not installed>
pandas: 2.2.2
pyarrow: 16.1.0
pydantic: <not installed>
pyiceberg: <not installed>
sqlalchemy: 2.0.31
torch: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[1], line 21
17 letters_dict = dict(letters_ids.iter_rows())
19 # Maps the dictionary
20 many_letters = many_letters.with_columns(
---> 21 pl.col('Letters').map_dict(letters_dict).alias('letters_mapped')
22 )
AttributeError: 'Expr' object has no attribute 'map_dict'
The other option I know of is to perform a join
. This unnecessarily causes duplication of a lot of data (imagine that left
is a not larger than right
, and right
is thus “broadcast” over left
), but is this the price one must pay in order to perform a lookup operation efficiently in terms of speed?