Here is some random code.
# create df
import pandas as pd
df2 = pd.DataFrame({'var1':['1_0','1_0','1_0','1_0','1_0'],
'var2':['X','y','a','a','a']})
df2.to_json('df2.json')
# import df
df2 = pd.read_json('df2.json')
df2
This is the expected output:
var1 var2
0 1_0 X
1 1_0 y
2 1_0 a
3 1_0 a
4 1_0 a
However it generates:
var1 var2
0 10 X
1 10 y
2 10 a
3 10 a
4 10 a
If I modify an entry inside ['var1']
to a string, then the code it generates when df
is imported is correct.
Here is an example to illustrate it. I replaced one of the entries with 'hello'
df2 = pd.DataFrame({'var1':['1_0','hello','1_0','1_0','1_0'],
'var2':['X','y','a','a','a']})
df2.to_json('df2.json')
# import df
df2 = pd.read_json('df2.json')
df2
Generates this
var1 var2
0 1_0 X
1 hello y
2 1_0 a
3 1_0 a
4 1_0 a
Same problem is observed if file is saved in csv format and then imported.
Has anyone encountered the same issue?
2
This is due to the fact that underscores are valid separators in python (often used as thousand separator: 1_000
is 1000
). You could force the dtype upon import (or use dtype=False
):
df2 = pd.read_json('df2.json', dtype='str')
If you want to keep dtype detection for the other columns:
df2 = pd.read_json('df2.json', dtype={'var1': 'str'})
Output:
var1 var2
0 1_0 X
1 1_0 y
2 1_0 a
3 1_0 a
4 1_0 a
When you have a string in the json, there is no ambiguity that the values are not numbers and the conversion is not done.