I have a piece of code that parses some data into series, does some modification, including potentially turning a few values into NaN, and then eventually building these series into a dataframe.
(see simplified code below)
sers = []
for item in items:
type_name = item.name
values = {pd.to_datetime(value.date_string): value.doc_count for value in item.values}
series = pd.Series(values, name=type_name, dtype=pd.Int64Dtype())
reindexed_series = series.reindex(date_range, fill_value=0)
exclude_inapplicable_days(reindexed_series, type_name)
apply_offset(reindexed_series, type_name)
sers.append(reindexed_series)
df = pd.DataFrame(sers)
in exclude_inapplicable_days
and apply_offset
I change some values to NaN in these series, which has a different meaning for us than 0, that we used as the fill_value.
It’s all fine until it’s just the series, they have int data type thanks to specifying it, but df
turns it all back into floats, despite all series in sers
having NaN compatible int types.
Why is this happening? Is there a way around it without re-iterating on df
and changing it all back?
Reproducible example:
import pandas as pd
date_range = pd.date_range(start="2023-01-01", end="2023-01-05")
items = [
{"name": "Type1", "values": [{"date_string": "2023-01-01", "doc_count": 1},
{"date_string": "2023-01-02", "doc_count": 2},
{"date_string": "2023-01-03", "doc_count": 3},
{"date_string": "2023-01-04", "doc_count": 4},
{"date_string": "2023-01-05", "doc_count": 5}]},
{"name": "Type2", "values": [{"date_string": "2023-01-01", "doc_count": 6},
{"date_string": "2023-01-02", "doc_count": 7},
{"date_string": "2023-01-03", "doc_count": 8},
{"date_string": "2023-01-04", "doc_count": 9},
{"date_string": "2023-01-05", "doc_count": 10}]}
]
def exclude_inapplicable_days(series, type_name):
series[series.index[0]] = pd.NA
print(f"After exclusion in {type_name}: {series.dtype}")
def apply_offset(series, type_name):
offset = {
"Type1": 2,
"Type2": 1
}[type_name]
if offset > 0:
series.iloc[-offset:] = pd.NA
print(f"After offset in {type_name}: {series.dtype}")
sers = []
for item in items:
type_name = item['name']
values = {pd.to_datetime(val['date_string']): val['doc_count'] for val in item['values']}
series = pd.Series(values, name=type_name, dtype=pd.Int64Dtype())
reindexed_series = series.reindex(date_range, fill_value=0)
exclude_inapplicable_days(reindexed_series, type_name)
apply_offset(reindexed_series, type_name)
sers.append(reindexed_series)
df = pd.DataFrame(sers)
print("DataFrame dtypes:")
print(df.dtypes)
user26390854 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
4
dtypes in a DataFrame are defined column-wise.
The DataFrame constructor will keep the dtypes if you use the input Series as columns. Here you’re using a list of Series (= records), thus the output is a transposition of the inputs and you will now have 5 columns/Series with new dtypes. Imagine if one of your two Series had been strings, then the output dtypes would have been upcasted to object:
# Int64 , string[python]
pd.DataFrame([sers[0], sers[1].astype('string')]).dtypes
2023-01-01 object
2023-01-02 object
2023-01-03 object
2023-01-04 object
2023-01-05 object
Freq: D, dtype: object
One option would be to concat
and transpose
:
df = pd.concat(sers, axis=1).T
Another would be to convert_dtypes
:
df = pd.DataFrame.from_records(sers).convert_dtypes()
Output:
2023-01-01 2023-01-02 2023-01-03 2023-01-04 2023-01-05
Type1 <NA> 2 3 <NA> <NA>
Type2 <NA> 7 8 9 <NA>
If I construct the dataframe from a dict and then transpose it, it works because it basically just copies each series into a column, forcing it to keep the data types:
df = pd.DataFrame({s.name: s for s in sers}).T
That still feels super verbose, and it’s more operation than I would like which might end up causing problems depending on the amount of data
user26390854 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.