I am working with a DataFrame in PySpark that contains a column named datdoc
, which has multiple date formats as shown below:
datdoc
07-SEP-24
07-SEP-2024
07-SEP-2024
07-SEP-2024
07-SEP-24
07-SEP-24
07-SEP-2024
07-SEP-2024
07-SEP-2024
07-SEP-2024
07-SEP-2024
I need to parse these dates into a default format. I’ve tried the following approaches, but I’m running into issues.
- First Attempt: Using CASE WHEN
I used the following payload to handle multiple date formats:
columns = {'field_name': 'datdoc', 'current_format': ['dd-MMM-yy', 'dd-MMM-yyyy'], 'data_type': 'Date'}
dateexpression = Column<'CASE WHEN (to_date(datdoc, dd-MMM-yy) IS NOT NULL) THEN to_date(datdoc, dd-MMM-yy) WHEN (to_date(datdoc, dd-MMM-yyyy) IS NOT NULL) THEN to_date(datdoc, dd-MMM-yyyy) ELSE NULL END AS datdoc'>
- Second Attempt: Single Format Parsing
I also tried simplifying to a single format:
columns = {'field_name': 'datdoc', 'current_format': ['dd-MMM-yy'], 'data_type': 'Date'}
date_expression = Column<'to_date(datdoc, dd-MMM-yy) AS datdoc'>
Python Function
def change_date_format(self, columns) -> None:
def _convert_date_format(field_name: str, current_format: list, is_timestamp: bool) -> F.Column:
base_function = F.to_timestamp if is_timestamp else F.to_date
expression = None
if len(current_format) == 1:
return base_function(F.col(field_name), current_format[0]).alias(field_name)
else:
for fmt in current_format:
current_expr = base_function(F.col(field_name), fmt)
if expression is None:
expression = F.when(current_expr.isNotNull(), current_expr)
else:
expression = expression.when(current_expr.isNotNull(), current_expr)
return expression.otherwise(F.lit(None)).alias(field_name)
cols = {col["field_name"] for col in columns}
date_expressions = []
for col in columns:
if col["data_type"] in ["DateTime", "Time"]:
date_expressions.append(_convert_date_format(col["field_name"], col["current_format"], True))
elif col["data_type"] == "Date":
date_expressions.append(_convert_date_format(col["field_name"], col["current_format"], False))
expression = [F.col(i) for i in self.df.columns if i not in cols]
self.df = self.df.select(*date_expressions, *expression)
In both cases, I encountered the following error when trying to parse 07-SEP-2024
using dd-MMM-yy
:
24/09/25 21:10:18 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 7) (rhy-4 executor driver): org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '07-SEP-2024' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
Click Here to View Whole Error
Question
Is there a way to ensure that invalid date strings are returned as NULL
instead of incorrectly parsed? One approach I considered is using CASE WHEN
with a RegEx pattern in PySpark. However, I would like to explore fixing my current approach first. Any guidance on how to achieve this would be greatly appreciated!
2
as mentioned in the logs the spark >=3.0 not letting to cast to date if the column has inconsistent date formats.
so we need to set the property to spark.sql.legacy.timeParserPolicy
to LEGACY
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
and then do the processing using the branching way using when()
data = [("07-SEP-24",), ("07-SEP-2024",), ("07-SEP-2024",), ("07-SEP-2024",),
("07-SEP-24",), ("07-SEP-24",), ("07-SEP-2024",), ("07-SEP-2024",),
("07-SEP-2024",), ("07-SEP-2024",), ("07-SEP-4",)]
df = spark.createDataFrame(data, ["date_string"])
#match using regex
df = df.withColumn("formatted_date",
when(
col("date_string").rlike(r"d{2}-[A-Z]{3}-d{2}"),
to_date(col("date_string"), "dd-MMM-yy")
).when(
col("date_string").rlike(r"d{2}-[A-Z]{3}-d{4}"),
to_date(col("date_string"), "dd-MMM-yyyy")
).otherwise(None))
df.show()
The second scenario where you convert using the standardizing the format using and udf and then parsing the data.
@udf
def standardize_format(date_string):
date = date_string.split("-")[0]
month = date_string.split("-")[1]
year = date_string.split("-")[2]
return f"{date}-{month}-20{year}"
df = df.withColumn("standerdized_date",
when(
col("date_string").rlike(r"d{2}-[A-Z]{3}-d{2}$"),
standardize_format(col("date_string")))
.otherwise(
when(
col("date_string").rlike(r"d{2}-[A-Z]{3}-d{4}$"),
col("date_string")).otherwise(lit(None))))
df = df.withColumn("date", to_date(col("standerdized_date"), "dd-MMM-yyyy"))
In the second case no need to set the spark.sql.legacy.timeParserPolicy
to LEGACY
If you are sure you only get those 2 patterns, and you are assuming 21st century, you can convert both with one pattern.
from pyspark.sql.functions import to_date
spark.conf.set("spark.sql.ansi.enabled","false") # this is needed if you get invalid dates
df = spark.createDataFrame(
[
(1, "07-SEP-2024"),
(2, "07-SEP-24"),
(3, "foo")
],
["id", "NotADate"]
)
df.withColumn("maybedate", to_date("NotADate","dd-MMM-yy")).show()
Which will return
+---+-----------+----------+
| id| NotADate| maybedate|
+---+-----------+----------+
| 1|07-SEP-2024|2024-09-07|
| 2| 07-SEP-24|2024-09-07|
| 3| foo| null|
+---+-----------+----------+
1