I have a column from a dataframe in python, I’m using pandas. In this column, I have some null values. I want to transform the entire column to the object
type.
When using astype
, the null values stop being null (I don’t want that, I want them to stay as null) and return as non-null (not what I want, as the number of nulls before changing the column type remains the same after changing the type). How do I do this: I’ve already used apply
with lambda x: np.nan if x is np.nan else str(x)
, but it doesn’t work. Can anyone help?
Exemple:
column_original = [1, 2, Nan, 6, 8, NaN]
This column has two NaNs. After applying the astype
function on the column, the number of NaNs change to 0.
df['column_original'].astype(str)
and
df['column_original'].apply (lambda x: np.nan if x is np.nan else x
and others.
Code
example:
import pandas as pd
import numpy as np
column_original = [1, 2, np.nan, 6, 8, np.nan]
df = pd.DataFrame(column_original, columns=['column_original'])
df:
column_original
0 1.0
1 2.0
2 NaN
3 6.0
4 8.0
5 NaN
convert values from a column to int and str, without converting null values
cond = df['column_original'].isna()
s = df['column_original'].astype('Int64').astype('str')
df['column_original'] = df['column_original'].where(cond, s)
df:
column_original
0 1
1 2
2 NaN
3 6
4 8
5 NaN
checking result
for i in df['column_original']:
print(i, type(i), pd.isna(i))
print:
1 <class 'str'> False
2 <class 'str'> False
nan <class 'float'> True
6 <class 'str'> False
8 <class 'str'> False
nan <class 'float'> True
If your starting point is the data itself (i.e. the list column_original
), rather than a df
already containing it, you can accomplish the desired result using pd.DataFrame
and passing a pd.Series
with dtype=str
:
import pandas as pd
import numpy as np
column_original = [1, 2, np.nan, 6, 8, np.nan]
data = {'float': column_original,
'object': pd.Series(column_original, dtype=str)}
df = pd.DataFrame(data)
Output:
df
float object
0 1.0 1
1 2.0 2
2 NaN NaN
3 6.0 6
4 8.0 8
5 NaN NaN
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 float 4 non-null float64
1 object 4 non-null object
dtypes: float64(1), object(1)
memory usage: 224.0+ bytes
for col in df:
print(f"{col}, values: {df[col].tolist()}")
float, values: [1.0, 2.0, nan, 6.0, 8.0, nan]
object, values: ['1', '2', nan, '6', '8', nan]
Note: you can also set dtype=str
inside pd.DataFrame
, but that will affect all columns:
pd.DataFrame(data, dtype=str).dtypes
float object
object object
dtype: object
Of course, if you start with the df
, but still have the list as well, you can also overwrite the data in a similar way:
column_original = [1, 2, np.nan, 6, 8, np.nan]
df = pd.DataFrame(column_original, columns=['column_original'])
df = df.assign(column_original=pd.Series(column_original, dtype=str))
df['column_original'].tolist()
# ['1', '2', nan, '6', '8', nan]
Explanation: When you are converting it to str
then nan -> 'nan'
will become string, so in the apply
function instead of using this lambda expression, lambda x: np.nan if x is np.nan else x
use this, lambda x: np.nan if x == 'nan' else x
IPython code :
import pandas as pd
import numpy as np
column_original = [1.0, 2.0, np.nan, 6.0, 8.0, np.nan]
df = pd.DataFrame(column_original, columns=['column_original'])
str_df = df.astype('str').apply(lambda x: np.nan if x == 'nan' else x)
# You can verify it with .isna() function.
str_df.isna()
Output:
0 False
1 False
2 True
3 False
4 False
5 True
Name: column_original, dtype: bool
Also from here, is an apt statement regarding when to use is
and ==
,
Use
is
when you want to check against an object’s identity (e.g. checking to see ifvar
isNone
). Use==
when you want to check equality (e.g. Isvar
equal to3
?).
Edit:
After reading @ouroboros1 answer, I want to simplify my answer to just two cases (explanation remains the same),
- If you want pandas
Series
object or pythonsList
column_original = [1, 2, Nan, 6, 8, NaN]
series = pd.Series(column_original, dtype=str)
pylist = pd.Series(column_original, dtype=str).tolist()
- If you want pandas
DataFrame
object
column_original = [1, 2, Nan, 6, 8, NaN]
df = pd.DataFrame(column_original, columns=['column_original'], dtype = 'str')
Hope it helps.