I’m importing data from a SQL table into an Azure Databricks dataframe by exporting it to a CSV file, saving it in Azure Blob Storage, and then loading it into the dataframe. Most records are imported correctly, but for a few, the RecordInsertedDate shows 2023-12-31 in both SQL and the CSV, yet appears as NULL in the dataframe.
Could someone help me understand why this happens? Also, is there a simpler way to import data directly from a SQL table into a dataframe? Here’s the code I’m using:
from pyspark.sql.functions import month, year, current_date, col
g5_df1 = spark. read. format ("csv")
option ("header", "true") option ("inferSchema", "true") load ("dbfs:/mnt/userspace/test/output/a.csv")
g5_df1.createOrReplaceTempView("g5_df1")
gs_df1.display()
from pyspark.sql.functions import month, year, current_date, col gs_df1 = spark.read.format(“csv”)
.option(“header”, “true”)
.option(“inferSchema”, “true”) .load(“dbfs:/mnt/userspace/test/output/a.csv”)
#gs_df1.createOrReplaceTempView(“gs_df1”)
gs_df1.display()
3
You can directly read the data from you SQL tables and Load into dataframe and create temporary table.
I have tried the below approach:
from pyspark.sql import SparkSession
import datetime
from multiprocessing import Process
jdbcHostname = "<your SQL SERVER>"
jdbcPort = 1433
jdbcDatabase = "db02"
jdbcUrl = f"jdbc:sqlserver://{jdbcHostname}:{jdbcPort};database={jdbcDatabase}"
connectionProperties = {
"user": "admin02",
"password": "Welcome@1"
}
dilip_df = spark.read.jdbc(url=jdbcUrl, table="SampleTable", properties=connectionProperties)
display(dilip_df)
Results:
ID Name RecordInsertedDate
1 Sample Name 1 2023-12-31
2 Sample Name 2 2023-12-31
Creating a temporary view:
dilip_df.createOrReplaceTempView("temp_sample_table")
result_df = spark.sql("SELECT * FROM temp_sample_table")
display(result_df)
ID Name RecordInsertedDate
1 Sample Name 1 2023-12-31
2 Sample Name 2 2023-12-31
As you mentioned you are seeing NULLs when reading the .CSV file from the ADLS
I have tried the below approach:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import pyspark.sql.functions as F
schema = StructType([
StructField("ID", IntegerType(), True),
StructField("Name", StringType(), True),
StructField("RecordInsertedDate", StringType(), True)
])
csv_file_path = f"abfss://folder02@<Your STORAGE Account>.dfs.core.windows.net/new02/Query 1 (3).csv"
dilip_df2 = spark.read.csv(csv_file_path, header=True, schema=schema)
dilip_df2 = dilip_df2.withColumn(
"RecordInsertedDate",
F.to_date(F.col("RecordInsertedDate"), "MM/dd/yyyy")
)
display(dilip_df2)
Results:
ID Name RecordInsertedDate
1 Sample Name 1 2023-12-31
2 Sample Name 2 2023-12-31
dilip_df2.createOrReplaceTempView("temp_sample_table2")
spark.sql("SELECT * FROM temp_sample_table2").show()
Results:
+---+-------------+------------------+
| ID| Name|RecordInsertedDate|
+---+-------------+------------------+
| 1|Sample Name 1| 2023-12-31|
| 2|Sample Name 2| 2023-12-31|
+---+-------------+------------------+
- In the above code defining the schema for the CSV file
- Reading the CSV file with the specified schema and infer date as
string
temporarily set asStringType
to handle parsing - Convert the RecordInsertedDate column to date format after reading the CSV & change the format according to your data