I need to read a JSON in French language and want to convert it English column names.
e.g. The Schema is like this
|-- unites: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- score: integer (nullable = true)
| | |-- statutDiffusionUnite: string (nullable = true)
| | |-- unitePurgeeUnite: boolean (nullable = true)
| | |-- dateCreationUnite: date (nullable = true)
| | |-- sigleUnite: string (nullable = true)
| | |-- sexeUnite: string (nullable = true)
| | |-- periodesUnite: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- dateFin: date (nullable = true)
| | | | |-- dateDebut: date (nullable = true)
| | | | |-- etatAdministratifUnite: string (nullable = true)
I can use the withColumn and alias OR withColumnRenamed to rename the top level. But I have problem to rename the array of struct (e.g. the periodesUnite.dateFin above).
My plan is to create a new English column mapping that periodesUnite: array
first (e.g.
new column named ‘UnitsPeriods’) and then create another column mapping the unites: array
(e.g. named ‘Units’) After that, replace the new ‘Units.periodesUnite’ by the new column ‘UnitsPeriods’.
The DF will be like this:
|-- unites: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- score: integer (nullable = true)
| | |-- statutDiffusionUnite: string (nullable = true)
| | |-- unitePurgeeUnite: boolean (nullable = true)
| | |-- dateCreationUnite: date (nullable = true)
| | |-- sigleUnite: string (nullable = true)
| | |-- sexeUnite: string (nullable = true)
| | |-- periodesUnite: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- dateFin: date (nullable = true)
| | | | |-- dateDebut: date (nullable = true)
| | | | |-- etatAdministratifUnite: string (nullable = true)
|-- Units: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- score: integer (nullable = true)
| | |-- broadcast_status: string (nullable = true)
| | |-- is_purged: boolean (nullable = true)
| | |-- creation_date: date (nullable = true)
| | |-- symbol: string (nullable = true)
| | |-- sex: string (nullable = true)
| | |-- UnitPeriods: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- dateFin: date (nullable = true)
| | | | |-- dateDebut: date (nullable = true)
| | | | |-- etatAdministratifUnite: string (nullable = true)
|-- UnitsPeriods: array (nullable = false)
| |-- element: array (containsNull = true)
| | |-- element: array (containsNull = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- end_date: date (nullable = true)
| | | | |-- start_date: date (nullable = true)
| | | | |-- administrative_state: string (nullable = true)
I use the following codes to create the new column, but I can’t find a way to fill the values from the ‘unites.periodesUnite’ to the new column ‘UnitsPeriods’
newdf = df.withColumn('UnitsPeriods', F.array(F.struct(*[F.lit(None).cast(f.dataType).alias(fr_eng_name_mappings[f.name] if f.name in fr_eng_name_mappings else f.name) for f in unit_period_schema])))
I also notice that the original column has the nullable set to true, but the new one has it set to false. Is there a way to set the nullable for the column?