I have the below python script that is fed arguments to read and excute an sql file to write contents to a txt file, however I am currently attempting to optimise for running large datasets – for example the sql query it reads selects a large number of fields (over 100) on around 4 million records.
My python code below shows my attempt at splitting the query into chunks (using offset/limit) to cater for memory as it would crash without this – the code works but appears very slow (takes around 30 mins) compared to running the full query without offset/limit on an sql query browser such as MYSQL workbench (takes around 4 mins).
def mysql_query():
chunk_size = 1000000
offset = 0
ndate = date()
db = mysql_dbconnection()
print("Executing SQL query")
cur = db.cursor()
print("Writing output to file")
### Create File & Write Records To File
with open(args["dpath"] + args["dfile"] + "." + args["ext"], "w", encoding = 'utf8') as feed_file:
# Open SQL File for Reading
with open((args["sqlsfile"]), 'r') as file:
query = " ".join(file.readlines())
print(query)
cur.execute(f"{query} LIMIT 1")
cur.fetchone()
### Get Field Names
field_tupple = [i[0] for i in cur.description]
field_names = 't'.join(field_tupple)
#print(field_names)
# Write main header
feed_file.write("EDI_Test_" + ndate + 'n')
# Write field headers
feed_file.write(field_names + 'n')
while True:
cur.execute(f"{query} LIMIT {chunk_size} OFFSET {offset}")
#print(f"{query} LIMIT {chunk_size} OFFSET {offset}")
### Get Records
records = cur.fetchall()
if not records:
break
for record in records:
for r in record:
d = {None: ''}
record = [d.get(x, x) for x in record]
r = 't'.join([str(x) for x in record if x is not None])
#print(r)
feed_file.write(r + 'n')
offset += chunk_size
print(offset)
# Write footer
feed_file.write("EDI_ENDOFFILE")
# Commit the changes to the database
cur.close()
db.close()
I found out another method to split the query into chunks could be to insert a where clause: primary key > n limit n, and incrementally increasing primary key > n until there are no more records, but not sure if this is practical as when reading the sql file I would then have to find out if a where clause is already present and where to place the new where clause etc.
Is there a a more optimal efficient way with speed to split the large sql query into chunks and writing to file in python/MYSQL?
3
LIMIT is known to have huge efficiency issues. Your WHERE solution may be cumbersome but will be probably the way to go. The LIMIT basically needs to evaluate the whole query each time to find where exactly to start taking records and where to end. WHERE solves that.
Now, I don’t know if there is space enough to make a temporary table but if there is, you might draw inspiration from the following (untested) example:
# Create the temporary table, duplicate the query results
cursor.execute(f"CREATE TEMPORARY TABLE temp_table AS ({query})")
# Add an autonumber primary key to the temporary table and fill it
cursor.execute("ALTER TABLE temp_table ADD COLUMN temp_id INT AUTO_INCREMENT PRIMARY KEY FIRST")
connection.commit()
# Add an autonumber primary key to the temporary table
cursor.execute("ALTER TABLE temp_table ADD COLUMN temp_id INT AUTO_INCREMENT PRIMARY KEY FIRST")
# Query the temporary table in chunks of n size
last_id = 0
while True:
# Query the temporary table using WHERE clause
cursor.execute(
"""
SELECT * FROM temp_table
WHERE temp_id > %s AND temp_id <= %s
ORDER BY temp_id
""",
(last_id, last_id + chunk_size)
)
# Fetch the results including the `temp_id` column
rows = cursor.fetchall()
# Break the loop if no more rows are returned
if not rows:
break
# CREATE YOUR OWN EXPORTING IN routines export_start, export_row, export_close
export_start(last_id);
for row in rows:
export_row(row[1:]) # Exclude the `temp_id` column
export_close(last_id);
# Update the last processed id
last_id += chunk_size
# Close the cursor and connection
cursor.execute("DROP TEMPORARY TABLE IF EXISTS temp_table")
cursor.close()
connection.close()
Basically you turn the query into a temporary table (will be deleted automatically or delete it yourself at the end of the routine). By adding an AUTO_NUMBER primary key you have a perfectly numbered table (no gaps in the keys) form 1 to the number of records in the query. So you can use WHERE quickly to reference exactly the chunk you want.
The downsize is a big temporary table.
Probably more experienced programmers can add some optimalization.