I am getting data from DBnomics which returns it as pandas df. I prefer polars, and so do pl.from_pandas(df)
and that sometimes fails with TypeError: 'float' object cannot be converted to 'PyString'
here is the code
import pandas as pd
from dbnomics import fetch_series
import polars as pl
df= fetch_series("WB/WDI/A-NY.GDP.MKTP.CD-RUS")
df = pl.from_pandas(df)
I could fix it in this specific case, but is there a way to robustly convert pandas to polars?
8
I’m not sure if this is a “bug” or not?
The problem with this specific example, is the original_value
column contains the string “NA” and float values.
>>> df["original_value"]
0 NA
1 NA
2 NA
3 NA
4 NA
...
59 1693115002708.320068
60 1493075894362.139893
61 1843392293734.379883
62 2266029240645.339844
63 2021421476035.419922
Name: original_value, Length: 64, dtype: object
Polars columns (Series) can only have a single type, and it is trying to convert this to str
– which fails due to the float values.
Replacing NA
with an actual NaN
value allows it to succeed.
df = pl.from_pandas(
df.assign(original_value=df["original_value"].replace("NA", float("nan"))))
)
shape: (64, 16)
┌────────────┬───────────────┬──────────────┬──────────────────────────────┬───┬─────────┬───────────────────┬───────────────────┬────────────────────┐
│ @frequency ┆ provider_code ┆ dataset_code ┆ dataset_name ┆ … ┆ country ┆ frequency (label) ┆ indicator (label) ┆ country (label) │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │
╞════════════╪═══════════════╪══════════════╪══════════════════════════════╪═══╪═════════╪═══════════════════╪═══════════════════╪════════════════════╡
│ annual ┆ WB ┆ WDI ┆ World Development Indicators ┆ … ┆ RUS ┆ Annual ┆ GDP (current US$) ┆ Russian Federation │
│ annual ┆ WB ┆ WDI ┆ World Development Indicators ┆ … ┆ RUS ┆ Annual ┆ GDP (current US$) ┆ Russian Federation │
│ annual ┆ WB ┆ WDI ┆ World Development Indicators ┆ … ┆ RUS ┆ Annual ┆ GDP (current US$) ┆ Russian Federation │
│ annual ┆ WB ┆ WDI ┆ World Development Indicators ┆ … ┆ RUS ┆ Annual ┆ GDP (current US$) ┆ Russian Federation │
│ annual ┆ WB ┆ WDI ┆ World Development Indicators ┆ … ┆ RUS ┆ Annual ┆ GDP (current US$) ┆ Russian Federation │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ annual ┆ WB ┆ WDI ┆ World Development Indicators ┆ … ┆ RUS ┆ Annual ┆ GDP (current US$) ┆ Russian Federation │
│ annual ┆ WB ┆ WDI ┆ World Development Indicators ┆ … ┆ RUS ┆ Annual ┆ GDP (current US$) ┆ Russian Federation │
│ annual ┆ WB ┆ WDI ┆ World Development Indicators ┆ … ┆ RUS ┆ Annual ┆ GDP (current US$) ┆ Russian Federation │
│ annual ┆ WB ┆ WDI ┆ World Development Indicators ┆ … ┆ RUS ┆ Annual ┆ GDP (current US$) ┆ Russian Federation │
│ annual ┆ WB ┆ WDI ┆ World Development Indicators ┆ … ┆ RUS ┆ Annual ┆ GDP (current US$) ┆ Russian Federation │
└────────────┴───────────────┴──────────────┴──────────────────────────────┴───┴─────────┴───────────────────┴───────────────────┴────────────────────┘
>>> df["original_value"]
shape: (64,)
Series: 'original_value' [f64]
[
null
null
null
null
null
…
1.6931e12
1.4931e12
1.8434e12
2.2660e12
2.0214e12
]
DuckDB for example, interprets the column to be of type String.
>>> duckdb.sql("from df").pl()["original_value"]
shape: (64,)
Series: 'original_value' [str]
[
"NA"
"NA"
"NA"
"NA"
"NA"
…
"1693115002708.32"
"1493075894362.14"
"1843392293734.38"
"2266029240645.34"
"2021421476035.42"
]
It’s not clear to me what the correct behaviour is here.
1