The first 3 x grouped dataframes insert into mySQL perfectly. The 4th Dataframe (or last in the series) appears to insert no data). Eg: in MySQL, running SELECT COUNT(*) FROM TEMP_2024_05_18; returns 0 results, however for all other temp tables it shows the correct number of rows!
NOTE: THERE’S AN INTERESTING FIND NOTED IN THE NEXT SECTION THAT WILL BLOW YOUR MIND.
In this example I have 1 x master dataframe that is split into 4 x grouped dataframes based on a column in the dataframe called: groupBy which contains a date string seperated with underscores.
Grouped_Dataframe_Name | Row_Counts |
---|---|
2024_05_15 – 1612 rows | 1612 |
2024_05_16 | 1332 |
2024_05_17 | 96 |
2024_05_18 | 83 |
The script then creates temporary tables in mySQL for each grouped dataframe and inserts the data from each df. All temporary tables are created perfectly, and all tables are populated perfectly EXCEPT for the last table in the series which is created perfectly (all columns and types match the others) but no data in the mySQL table.
My Python script performs the below steps:
- Splits a dataframe into grouped dataframes using a segmentation column called groupBy (containts string of date eg: 2024_05_17).
- FOR EACH GROUPED DATAFRAME
Creates a temporary table in mySQL with a dynamic title: temp_lookup_2024_05_17 where the date portion reflects the groupedBy value of the grouped dataframe. - FOR EACH GROUPED DATAFRAME Inserts the rows from each grouped data frame into it’s corresponding table that was created.
CODE SNIPPET
# Group the DataFrame by 'group_column'
grouped = df.groupby('groupBy')
for group_name, group_data in grouped:
# Create temporary table for each dataframe in for statement
create_temp_table_query = conn_prod.execute(text(f"""CREATE TABLE tmp_lookup_{group_name} as SELECT * FROM staging_lookup limit 0;"""))
# Insert data from dataframe into temporary table
group_data.to_sql(f"tmp_lookup_{group_name}", con=conn_prod, if_exists='replace', index=False)
THINGS I’VE TRIED AND INTERESTING FINDINGS
-
Printing the master dataframe shows ALL data as expected
-
Printing the contents of each grouped dataframe is showing each group ONLY contains it’s grouped data as expected
for group_name, group_data in grouped:
print(group_name)
print(group_data)
print()
-
Counting the number of rows in each grouped dataframe shows the correct number of rows.
-
Counting the number of rows in each temporary table IN THE PYTHON SCRIPT and printing the result on screen shows the CORRECT number of rows. This includes the FINAL table!
Snippet below
# Insert data from dataframe into temporary table
group_data.to_sql(f"tmp_lookup_{group_name}", con=conn_prod, if_exists='replace', index=False)
# Count rows inserted by select count(*) on each table
inserted_count = conn_prod.execute(text(f""" SELECT count(*) FROM tmp_lookup_{group_name}; """))
print('Rows inserted: ' + str(inserted_count.first()[0]))
As the TEMP_2024_05_18 table shows in python that there are 83 rows in the table, I immediately jump to MySQL and do a count on the table and it shows 0 rows.
I’ve also tried switching from using engine.connect and the to_sql method to mysql.connector and cursor method. Both have the same result.
Any help, observations or alternative methods are greatly appreciated.
space_monkey_00 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.