I’m trying to read data from database and then save it to parquet file using Kotlin and Apache Spark.
JDBC Driver I use: com.mysql.cj.jdbc.Driver
val customerDf = spark
.read()
.jdbc(
"jdbc:mysql://$host:$port/$database",
"t_customer",
props)
After loading the data and checking the schema, it turns out that all fields allow null values, even though the table definition clearly imposes NOT NULL contraints. I fixed it by adding the following line:
val schemaCustomerDf = spark.createDataFrame(customerDf.toJavaRDD(), Customer.SCHEMA)
Parquet DDL:
required int32 id
required binary name (STRING)
optional binary name_2 (STRING)
required binary surname (STRING)
required binary gender (STRING)
required binary email_address (STRING)
optional binary phone_number (STRING)
required int32 date_birth (DATE)
required int32 date_join (DATE)
required binary address_street (STRING)
optional binary address_street2 (STRING)
required binary address_city (STRING)
required binary address_state (STRING)
required binary address_postal_code (STRING)
required binary address_country_iso_2 (STRING)
However, after reading the data again, this time from the previously saved parquet file, all fields are nullable again.
Is this some kind of bug or am I doing something wrong?
I tried to force the schema defined in the Customer class, but I don’t think that’s the way to go because parquet files store metadata about the schema.
anthis is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.