I’m trying to copy data between two data sources. If you’re curious, I’m copying data between Big query and Cloud SQL. It’s not really relevant here, this question should be transposable to other types of database.
I can dump data from my first data source (Big query) into files, either json or csv files. And I can read from these files (either in batch or streaming), and I want to write this data into a postgres database (cloud SQL). I would like to do this as quickly/efficiently as possible. I’m using a job (a cloud run job) to read the data from the files. It looks like this:
with open("data/super_data.csv", "r") as f:
with global_pool.connection() as conn:
with conn.cursor() as cur:
with cur.copy("COPY my_table FROM STDIN") as copy:
while data := f.read(1024):
copy.write(data)
I use psycopg3.
I have a few questions regarding this approach:
- I’m voluntarily trying to not parse the csv file or to use
write_row
. Is there a more efficient way to do this? Maybe tweaking the1024
parameter? Maybe use the pipeline mode? - I tried reading from a json file, but it doesn’t work, it looks like the text from the file is being read as is, and the symbols (coming from the json) crash the copy. Is there a way to modify the snippet above to read from json? I can do something like this with psql:
create table temp (data jsonb);
copy temp (data) FROM 'data/super_data_ndjson.json';