When I’m reading some json payloads PySpark is changing the data even if I read it as a StringType and I want this as a String because I don’t want to have each field as a column at this step. I just want to get this payload as String as it is in payload/source file
Locally I’m using Spark 3.3 in Jupiter Notebook with Glue 4 image
PySpark version: 3.3.0+amzn.1.dev0
Here my payload/source (test.txt):
{"payload":{"points":1220000000}}
{"payload":{"count":1550554545.0}}
{"payload":{"points":125888002540.0, "count":1550554545.0}}
{"payload":{"name": "Roger", "count":55154111.0}}
Here my code:
path = "/home/glue_user/workspace/jupyter_workspace/test/test.txt"
schema = StructType([StructField('payload', StringType(), True)])
my_df = spark.read.schema(schema).option("inferSchema", "false").json(path)
my_df.show(truncate=False)
Here the result where PySpark is setting the float number in scientific notation, even when I read it as String.
+------------------------------------------------+
|payload |
+------------------------------------------------+
|{"points":1220000000} |
|{"count":1.550554545E9} |
|{"points":1.2588800254E11,"count":1.550554545E9}|
|{"name":"Roger","count":5.5154111E7} |
+------------------------------------------------+
Why I can’t simply have my data as it is?
Why the final result is changed into my string field and receive this scientific notation?
i.e:
“count”:1550554545.0
“count”:1.550554545E9
2
I did a research little bit and asked my colleagues about it and I got to know, spark JSON reader automatically infers schema and you can’t override it.
What you can do is read it as a text first and then try to covert values.
I got same issue while reading data from SQL DB and I used cast expression in query itself.
For your use this is working for me. I tried to use from_json, MapType, get_json_object all three but I think whenever we convert to json, spark will convert to scientific notation.
df = spark.read.text(path)
df = df.withColumn('value', regexp_replace('value', r'^{"payload":', ''))
df = df.withColumn('value', regexp_replace('value', r'}$', ''))
3