I have a dataframe with a column called ‘geometry’, it contains Multipolygon and Polygon values. I want to use Pyspark or Python to find and return where coordinates contain [X, Y, Z]; also would like to create a code block to remove the Z value if exists. How can I do this? Example below, I’d like to return the first coordinate value, and not return any other values.
I want to do something like below but can’t figure out how to append a new column to the dataframe, the coding required to look for and return only rows with X,Y,Z geometry:
Code I’m thinking of using to find Z value, but does not work:
for row in source_df.collect():
z_val = len(row.geometry["coordinates"]) =3 for x in row.geometry["coordinates"]]
Example data:
{
"type": "MultiPolygon",
"coordinates": [
[-120.92484404138442,35.54577502278743,0.0],
[-120.92484170835023,35.545764670080004],
[-120.92470946198651,35.54517811398435],
[-120.92373579577058,35.54476080459215],
[-120.92224560209857,35.544644824151],
[-120.91471743922112,35.54405891151482],
[-120.9137131887035,35.541405607829184],
[-120.91370267246779,35.54138005556737],
[-120.91368022915093,35.54133577314701],
[-120.91365314934913,35.54129325687539],
[-120.91364620938849,35.541283659095036],
[-120.91019544280519,35.53661949063082],
[-120.91016692865233,35.536584105321104],
[-120.91013516362523,35.53655061941634],
[-120.9101046793985,35.53652289241281],
[-120.90545581970368,35.53257237955164],
[-120.90540343303125,35.53253236763702]
]
}
0
What you can do is use udf
in sparkframe which can be used to parses the JSON string
, allowing for easy check whether we have (‘Polygon’ or ‘MultiPolygon’), and iterates through the coordinates to identify any containing three values (X, Y, Z).
Finally, You can use Boolean column (z_column), and the df
is filtered to retain only rows where this column is True.
# create df
df = spark.createDataFrame(data)
# define a UDF to check for the z cords
def contains_z_coordinate(geometry):
try:
geometry_object = json.loads(geometry)
# Check if it's a Polygon or MultiPolygon
if geometry_object["type"] in ["Polygon", "MultiPolygon"]:
# Extract the first coordinate set
coords = geometry_object["coordinates"]
if geometry_object["type"] == "Polygon":
coords = [coords] # Convert Polygon to MultiPolygon
# Check if any point has a z-coordinate
for polygon_coords in coords:
for ring_coordinates in polygon_coords:
for point in ring_coordinates:
if len(point) == 3:
return True
except Exception as e:
print(f"{e}")
return False
return False
# Register the UDF
z_udf = udf(contains_z_coordinate, BooleanType())
# Add a column with UDF and filter
result_df_using_filter = df.withColumn("z_column", z_udf(col("geometry"))).filter(
col("z_column")
)
# results
result_df_using_filter.show(truncate=False)
ouptut
[Stage 0:> (0 + 0) / 1]
[Stage 0:> (0 + 1) / 1]
+--------------------------------------------------------------------------------------------------------------------------------+--------+
|geometry |z_column|
+--------------------------------------------------------------------------------------------------------------------------------+--------+
|{"type":"MultiPolygon","coordinates":[[[[-120.92484404138442,35.54577502278743,0.0],[-120.92484170835023,35.545764670080004]]]]}|true |
+--------------------------------------------------------------------------------------------------------------------------------+--------+
2