I have a CDC data feed from SQL Server that I ingest into our delta lakehouse using Azure Databricks. Some of the fields are stored as a datetimeoffset
in SQL Server and when I load them using pyspark, I have to ingest them as a StringType
rather than a TimestampType
(otherwise they parse as a null). If I try to clean up the data using the to_timestamp()
method, I also get a null. The problem seems to be with specifying the offset. If I do not specify an offset in the format string, the data parses fine (although it ignores the offset which I need as these timestamps are keyed to the customer’s timezone and thus every transaction is in a different timezone depending on the customer). I’ve posted an example below.
dfData.withColumn('DateCreated', to_timestamp(col('DateCreated'), 'M/d/yyyy h:mm:ss a ZZZZZ'))
A sample date looks like: 4/15/2020 3:09:50 PM -07:00
. If I remove the 'ZZZZZ'
from the format string, I get a timestamp that looks like 2020-04-15T15:09:50.000+00:00
(note the missing offset information). With the 'ZZZZZ'
in the format string, it returns null
. I have tried every iteration of the various format options for offset strings listed in the pyspark documentation but none of them work (and a few seem to be no longer supported as I get IllegalArgumentException: Illegal pattern character
errors if I use x
or O
: only X
and Z
seem to be supported as they do not throw exceptions—however regardless of the number of characters I specify (e.g. 'xxx'
or 'Z'
or 'ZZZZ'
or 'xx'
), the data always parses as null). Has anyone else run into this issue? How do I resolve it? For now, I’m leaving the data as StringType
but it will make any datetime analysis difficult for downstream workloads if I cannot resolve this issue.
In case it’s needed, my cluster is running Databricks v 15.1 which includes Apache Spark 3.5.0. The data is coming from SQL Server 2016 v 13.0.6435.1.