Can anyone highlight why the same function is giving 2 differnets outputs with the same inputs?
One environment tested with pandas==1.5.3
and other pandas=2.2.0
.
Using python 3.10 both times.
I’m trying to use pandas.to_datetime() to convert a string column to datetime type, that contains saledates as a full date and time including timezone info, with both PST and PDT (GMT -0800 and -0700 respectively), and I want to make it manageable to a datetime type for further analysis, by adding the pd.to_datetime(utc=True)
argument.
The code I’m using:
t1 = "Tue Dec 16 2014 12:30:00 GMT-0800"
t2 = "Fri Apr 03 2015 01:29:00 GMT-0700"
t3 = "Fri Apr 03 2015 02:00:00 GMT-0700"
df = pd.DataFrame({'saledate': [t1, t2, t3]})
df['saledate'] = pd.to_datetime(df['saledate'], errors="coerce", utc=True)
print(df)
Sample of input dataframe:
saledate
0 Tue Dec 16 2014 12:30:00 GMT-0800
1 Fri Apr 03 2015 01:29:00 GMT-0700
2 Fri Apr 03 2015 02:00:00 GMT-0700
output with pandas==1.5.3
saledate
0 2014-12-16 04:30:00+00:00
1 2015-04-02 18:29:00+00:00
2 2015-04-02 19:00:00+00:00
output with pandas==2.2.0
saledate
0 2014-12-16 20:30:00+00:00
1 2015-04-03 08:29:00+00:00
2 2015-04-03 09:00:00+00:00
The 2.2.0 output is correct as per my understanding of the equivalent time in GMT+00.00 format.
Why is there such ambiguity?
For reference, this is part of a dataset from Kaggle I’m using for a personal project of data engineering.
4