I am trying to convert date and time strings to POSIXct, however it gives me the wrong output. I have the following code:
df <- data.frame(date = c("2020-10-25", "2020-10-25", "2020-10-25", "2020-10-25", "2020-10-25", "2020-10-25", "2020-10-25"),
hour = c("00:00", "01:00", "02:00", "02:00", "03:00", "04:00", "05:00"))
df$date_time <- as.POSIXct(paste0(df$date," ", df$hour)
, format="%Y-%m-%d %H:%M",
tz = "Europe/Paris")
If I try to compare the output with an hourly sequence generation with the same datetime aforementioned, the daylight saving time (DST) is not the same:
df2 <- seq(as.POSIXct("2020-10-25 00:00:00", tz = "Europe/Paris"),
as.POSIXct("2020-10-25 05:00:00", tz = "Europe/Paris"),
by = "hour")
df$date_time == df2
[1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE
How may I convert the date and time from string with the right DST?
4
The problem is that the limits of your sequence are on different timezones:
> as.POSIXct("2020-10-25 00:00:00", tz = "Europe/Paris"),
[1] "2020-10-25 CEST"
> as.POSIXct("2020-10-25 05:00:00", tz = "Europe/Paris")
[1] "2020-10-25 05:00:00 CET"
So when you use seq
, you get three times on CEST and four on CET:
> df2 <- seq(as.POSIXct("2020-10-25 00:00:00", tz = "Europe/Paris"),
as.POSIXct("2020-10-25 05:00:00", tz = "Europe/Paris"),
by = "hour")
> strftime(df2, "%F %R %Z")
[1] "2020-10-25 00:00 CEST" "2020-10-25 01:00 CEST" "2020-10-25 02:00 CEST"
[4] "2020-10-25 02:00 CET" "2020-10-25 03:00 CET" "2020-10-25 04:00 CET"
[7] "2020-10-25 05:00 CET"
On the other hand, when creating the initial data.frame
you get four times on CEST and three on CET, because the third and fourth elements are both input as “2020-10-25 02:00”.
> strftime(df$date_time, "%F %R %Z", tz = "Europe/Paris")
[1] "2020-10-25 00:00 CEST" "2020-10-25 01:00 CEST" "2020-10-25 02:00 CEST"
[4] "2020-10-25 02:00 CEST" "2020-10-25 03:00 CET" "2020-10-25 04:00 CET"
[7] "2020-10-25 05:00 CET"
If looking at both vectors side by side, it is clear that df
has a repeated value, while seq
creates a true sequence:
> cbind(df$date_time, df2)
df2
[1,] 1603576800 1603576800
[2,] 1603580400 1603580400
[3,] 1603584000 1603584000
[4,] 1603584000 1603587600
[5,] 1603591200 1603591200
[6,] 1603594800 1603594800
[7,] 1603598400 1603598400
So when comparing both vectors, on the fourth element you are comparing a CEST time with a CET time, and it doesn’t match.
The crucial part is that your input is not an hourly sequence, because (I’m guessing) you want the first 02:00 to be in CEST (summer time) and the second 02:00 to be in CET (winter time). However, R has no way of knowing this from your input…
How is as.POSIXct
supposed to know that "2020-10-25 02:00:00"
and "2020-10-25 02:00:00"
are different? Usually, it is recommended to record datetimes either in UTC or to also record a CET/CEST indicator (or even better the timezone offset).
You will need to solve this by explicitly supplying the datetime of the DST switch or relying on values being duplicated.
df <- data.frame(date = c("2020-10-25", "2020-10-25", "2020-10-25", "2020-10-25", "2020-10-25", "2020-10-25", "2020-10-25"),
hour = c("00:00", "01:00", "02:00", "02:00", "03:00", "04:00", "05:00"))
df$date_time_str <- paste0(df$date," ", df$hour)
df$offset <- ifelse(cumsum(duplicated(df$date_time_str)) > 0, -1, -2)
df$date_time <- as.POSIXct(df$date_time_str, tz = "UTC") + df$offset * 3600
as.POSIXct(df$date_time, tz = "Europe/Paris")
#[1] "2020-10-25 00:00:00 CEST" "2020-10-25 01:00:00 CEST" "2020-10-25 02:00:00 CEST" "2020-10-25 02:00:00 CET"
#[5] "2020-10-25 03:00:00 CET" "2020-10-25 04:00:00 CET" "2020-10-25 05:00:00 CET"
Some additional work is needed if you have several DST switches in your data.