I have parquet files stored in a google cloud bucket, I want to view and process that data using polars in python
I am able to download the parquet file into a polars dataframe by doing the following:
- I get the files using the google cloud storage client:
files_iterator = storage.Client().get_bucket("bucketname").list_blobs()
files = []
for file in files:
files.append(file)
- I get the gsutil uri (thanks stack overflow):
uri = 'gs://' + file.id[:-(len(str(file.generation)) + 1)]
- I read in the file:
import polars as pl
pl.read_parquet(uri)
This works perfectly, except for a file which contains a space in the path. My code returns the same gsutil URI that I can see in the GCP console, which encodes the space with a %20
:
gs://bucketname/long_path/a_key=a_value/another_key=somethingp_respace%20something_afteraspace/0.parquet
pl.read_parquet gives a 404 for this. I’ve tried other encodings (+, backslash, %2520) to no avail.
Only possible workaround I have found is to download the blob file as bytes and then pass that into read_parquet, however then there is no metadata included in the dataframe, which I need.
Appreciate any help, thank you!!
blake s is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.