I have a database table, lets call it t_customer
, that has millions of records. I need to process each records and I decide to process them in batch of 500, ordered by created_date
. After processing the record, I will insert the result of the process to another table, lets call this one t_summary_log
. The cardinality of their relationship is optional one is to one. After inserting into t_summary_log
, the particular t_customer
will be queued again at the end of the list while waiting for the remainder of the t_customer
table to be processed.
After processing the last record, it will reprocess again from the oldest record, updating the t_summary_log.last_updated_date
every iteration. The only time a t_customer
will be removed from the loop is when it’s matching entry in t_summary_log
has a status_flag=1
. Here is how I fetch the batch of records I will about to process.
SELECT
a.id,
a.username,
IFNULL(b.last_updated_date, c.created_date) AS last_updated_date
FROM
t_customer a
LEFT JOIN
t_summary_log b ON b.customer_id = a.id
WHERE
NOT EXISTS( SELECT
1
FROM
t_summary_log b0
WHERE
b0.customer_id = a.id
-- 0=on-going, 1=final, no modification needed
AND b0.status_flag = 1)
ORDER BY 3 ASC
LIMIT 500;
I am trying to avoid accessing the t_summary_log
table twice. Is there a type of JOIN
where it wont include the rows of the left table if it satisfy a certain condition on the right table? I think its a waste to perform the EXISTS
condition when it could easily check the status_flag
column right away in the join clause. Assuming that the t_summary_log
will also have millions of record and will have to scan it TWICE.