I am working with BigQuery, Dataproc, Workflows, and Cloud Storage in Google Cloud using Python.
I have two GCP projects:
-
gcp-project1: contains the BigQuery dataset gcp-project1.my_dataset.my_table
-
gcp-project2: contains my myscript.py and my files stored in Cloud Storage
In myscript.py, I am trying to read a SQL query from a file stored in Cloud Storage (query1=gs://path/bq.sql) and query data from the BigQuery dataset in gcp-project1.
According to the documentation here, when reading from BigQuery using a SQL query, I need to set the properties viewsEnabled=true and materializationDataset=dataset.
Here are the approaches I tried:
Test 1:
spark.conf.set("viewsEnabled", "true")
spark.conf.set("materializationDataset", "my_dataset")
This fails because it searches for the dataset in gcp-project2 (where myscript.py is running), but my dataset is in gcp-project1. The error is: Not found: Dataset gcp-project2:my_dataset was not found in location…
Test 2:
spark.conf.set("viewsEnabled", "true")
spark.conf.set("materializationDataset", "gcp-project1.my_dataset")
This fails with the error: Dataset IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long.
Test 3:
spark.conf.set("viewsEnabled", "true")
spark.conf.set("materializationDataset", "my_dataset")
try:
df = spark.read.format('bigquery')
.option('project', 'gcp-project1') #Adding gcp-project1 contains dataset
.option('query', query1)
.load()
df.printSchema()
df.show(10)
except Exception as e:
logger.error(f"Failed to read data from BigQuery: {e}")
sys.exit(1)
This also fails with the same error: Not found: Dataset gcp-project2:my_dataset was not found in location…
Question:
How can I configure my PySpark script to read data from a BigQuery dataset in gcp-project1 while running the script in gcp-project2?
Any suggestions for interacting with datasets across different GCP projects would be appreciated.
Thanks in advance!
2