I am importing pyspark libraries as follows
from pyspark.sql import SparkSession
from pyspark import SparkConf
from pyspark.sql import functions as F
from pyspark.sql import types as T
conf = SparkConf().set("spark.sql.catalogImplementation","hive")
spark = SparkSession.builder.appName("SparkApp").config(conf=conf).getOrCreate()
sc = spark.sparkContext
I have a spark data frame as follows
df_delay.show(5, truncate=False)
+-----------+----------------------+----------------------+
|Call Number|Received DtTm |Dispatch DtTm |
+-----------+----------------------+----------------------+
|1030101 |04/12/2000 09:00:29 PM|04/12/2000 09:02:00 PM|
|1030104 |04/12/2000 09:09:02 PM|04/12/2000 09:10:29 PM|
|1030106 |04/12/2000 09:09:44 PM|04/12/2000 09:11:47 PM|
|1030107 |04/12/2000 09:13:47 PM|04/12/2000 09:14:13 PM|
|1030108 |04/12/2000 09:14:43 PM|04/12/2000 09:16:24 PM|
+-----------+----------------------+----------------------+
whose schema is
root
|-- Call Number: integer (nullable = true)
|-- Received DtTm: string (nullable = true)
|-- Dispatch DtTm: string (nullable = true)
I want to cast the columns into timestamp and find the difference.
I wrote a UDF that converts the string into Timestamp
convert_to_datetime = F.udf(lambda l: F.unix_timestamp(l,"MM/dd/yyyy hh:mm:ss a").cast(T.TimestampType()), T.TimestampType())
And tried applying it on the dataframe
df_delay.withColumn("DispatchTS",
convert_to_datetime('Dispatch DtTm')
).show(5)
But getting error as follows:
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "C:UsersajinkAppDataLocalTempipykernel_90923273235857.py", line 1, in <lambda>
File "C:softwareprogrammingspark-3.4.0-bin-hadoop3pythonlibpyspark.zippysparksqlutils.py", line 159, in wrapped
return f(*args, **kwargs)
File "C:softwareprogrammingspark-3.4.0-bin-hadoop3pythonlibpyspark.zippysparksqlfunctions.py", line 5254, in unix_timestamp
return _invoke_function("unix_timestamp", _to_java_column(timestamp), format)
File "C:softwareprogrammingspark-3.4.0-bin-hadoop3pythonlibpyspark.zippysparksqlcolumn.py", line 63, in _to_java_column
jcol = _create_column_from_name(col)
File "C:softwareprogrammingspark-3.4.0-bin-hadoop3pythonlibpyspark.zippysparksqlcolumn.py", line 55, in _create_column_from_name
sc = get_active_spark_context()
File "C:softwareprogrammingspark-3.4.0-bin-hadoop3pythonlibpyspark.zippysparksqlutils.py", line 201, in get_active_spark_context
raise RuntimeError("SparkContext or SparkSession should be created first.")
RuntimeError: SparkContext or SparkSession should be created first.
I have no idea what I am doing wrong and how to correct. Please help. Greatly appreciate any advice or solution.