I have downloaded the NYC Taxi Zones dataset (downloaded from SODA Api and saved as json file – Not GeoJson or Shapefile). The dataset is rather small, thus I am using the whole information included. For the convenience of the post I am presenting the first 2 rows of the dataset:
- original (with struct type value of
the_geom
). - dataset after unpacking the struct type with unnest() command in polars.
The original dataset:
Below the dataset after applying unnest()
and selecting some columns and the first 2 rows.
You may import the data using polars with the following command
poc = pl.read_json("./data.json"))
I am interested in the multipolygons. I am actually trying to re-calculate the shape_area
by using the multipolygon and wkt (Well-Known-Text representation) – method used by shapely
module.
What I have done so far is to use the column coordinates
and transform it to a MultiPolygon() object – readable by the Shapely module.
def flatten_list_of_lists(data):
# return [subitem4 for subitem1 in data for subitem2 in subitem1 for subitem3 in subitem2 for subitem4 in subitem3]
return [subitem3 for subitem1 in data for subitem2 in subitem1 for subitem3 in subitem2]
The function takes as input a list[list[list[list[f64]]]]
object and transforms to a list[list[f64]]
object.
flattened_lists = [flatten_list_of_lists(row) for row in poc["coordinates"].to_list()]
print(flattened_lists)
[[[-74.18445299999996, 40.694995999999904], [-74.18448899999999, 40.69509499999987], [-74.18449799999996, 40.69518499999987], [-74.18438099999997, 40.69587799999989], [-74.18428199999994, 40.6962109999999], [-74.18402099999997, 40.69707499999989]…
Then I use the function below that applies string concatenation and basically:
- Transforms the
list[list[f64]]
object to String. - Adds the keyword MultiPolygon in front of the string.
- Replaces ‘[‘ and ‘]’ with ‘(‘., ‘)’ respectively.
def polygon_to_wkt(polygon):
# Convert each coordinate pair to a string formatted as "lon lat"
coordinates_str = ", ".join(f"{lon} {lat}" for lon, lat in polygon)
# Format into a WKT Multipolygon string (each polygon as a single polygon multipolygon)
return f"""MULTIPOLYGON ((({coordinates_str})))"""
formatted_wkt = [polygon_to_wkt(polygon) for polygon in flattened_lists]
poc = poc.with_columns(pl.Series("WKT", formatted_wkt))
Finally, I use the method wkt.loads("MultiPolygon ((()))").area
to compute the shape area of the Multipolygon object
def convert_to_shapely_area(wkt_str):
try:
return wkt.loads(wkt_str).area
except Exception as e:
print("Error converting to WKT:", e)
return None
poc = poc.with_columns(
pl.col("WKT").map_elements(convert_to_shapely_area).alias("shapely_geometry")
)
print(poc.head())
Even though for the first shape the WKT correctly returns the area of the object, while for the second MultiPolygon the methods returns the following error:
IllegalArgumentException: Points of LinearRing do not form a closed linestring
What I have noticed between the two rows, is that the multipolygon of Newark Airport is a continues object of list[list[f64]]] coordinates. Whereas, the Jamaika Bay has multiple sublists [list[list[f64]]] elements (check column coordinates to verify this). Also the screenshot below verify this statement.
Thus, is there any way to unify the multipolygons of Jamaica Bay into one single GEOmetric object before applying WKT?
P.S: Many solutions on GitHub use the shape file, but I would like to regularly re-download the NYC zones automatically with the SODA api.