Problem:
I’m facing a significant performance issue when executing a BigQuery query in a Vertex AI pipeline. Here’s the example code:
query = "SELECT * FROM `project.dataset.table`" # returns 5,000,000 rows
query_job = client.query(query)
df = query_job.result().to_dataframe()
The instruction df = query_job.result().to_dataframe()
takes around 50 minutes to execute on Vertex AI, whereas it only takes less than 5 minutes on my local machine.
.set_cpu_limit("60")
.set_memory_limit("128G")
I’m using the to_dataframe()
method to load the data into a Pandas DataFrame, it returns around 5 millions rows.
Questions:
- Why is this operation much slower on Vertex AI compared to a local machine?
- What adjustments can I make to the Vertex configuration or the code to improve the performance of this operation?
Any advice would be greatly appreciated!
Load data from Google BigQuery you can also usepandas_gbq.read_gbq
as an alternative to improve the performance & refer to this Doc once .
You can also try to run your query faster on Bigquery using query optimization
-
not using
select*
, try to specify thecolumn & row
name , -
to load whole data from dataset it can delay . use can use limit.
You can check this doc Optimize query computation and Convert Bigquery results to Pandas Data Frame
2
You can find some suggestions here: Convert Bigquery results to Pandas Data Frame . Let us know if any of it works for you.
- Can you try installing all the extras?
pip install 'google-cloud-bigquery[all]'
- Can you create the Notebook instance in same location/region as your BigQuery database?