I have a table with 36 million record and I need to update fields on some conditions but it is taking 10 min for each query and I have many queries to run :
MERGE INTO account USING ( SELECT id, COUNT( Account) as numAcc
FROM account where Status = 'ACTIVATED' GROUP BY id) ba
ON (ba.id = b.id ) WHEN MATCHED THEN UPDATE SET b.numAcc = ba.numAcc;
above a sample query and other queries look the same but with changed condition and also the following is the explain
-------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | 36M| 892M| | 610K (1)| 00:00:24 |
| 1 | MERGE | ACCOUNT | | | | | |
| 2 | VIEW | | | | | | |
|* 3 | HASH JOIN | | 36M| 19G| 511M| 610K (1)| 00:00:24 |
| 4 | VIEW | | 998K| 500M| | 365K (1)| 00:00:15 |
| 5 | SORT GROUP BY | | 998K| 502M| | 365K (1)| 00:00:15 |
| 6 | VIEW | VW_DAG_1 | 25M| 12G| | 365K (1)| 00:00:15 |
| 7 | SORT GROUP BY | | 25M| 1025M| 1346M| 365K (1)| 00:00:15 |
|* 8 | TABLE ACCESS FULL| ACCOUNT | 25M| 1025M| | 91408 (1)| 00:00:04 |
| 9 | TABLE ACCESS FULL | ACCOUNT | 36M| 2162M| | 91501 (1)| 00:00:04 |
-------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("BA"."ID"="B"."ID")
8 - filter("Status "='ACTIVATED')
Sample Data:
id account status month
3141243131 1162278310 CLOSED 1
3141243131 1014532510 ACTIVATED 1
3141243131 1094289210 ACTIVATED 1
3141243131 1162278310 CLOSED 2
3141243131 1094289210 ACTIVATED 2
3141243131 1014532510 ACTIVATED 2
3141243131 1094289210 ACTIVATED 3
3141243131 1162278310 CLOSED 3
3141243131 1014532510 ACTIVATED 3
5432523522 1111231231 CLOSED 1
5432523522 1014532510 ACTIVATED 1
5432523522 1094589210 ACTIVATED 1
5432523522 1111231231 CLOSED 2
5432523522 1094289210 ACTIVATED 2
5432523522 1094589210 ACTIVATED 2
5432523522 1094289210 CLOSED 3
5432523522 1111231231 CLOSED 3
5432523522 1094589210 ACTIVATED 3
1
The slowness is almost certainly not due to row retrieval in this case, so using ROWID or some other nested-loops join is not going to help (it will actually make it worse – a full scan with hash join is in fact the best option here, as you aren’t limiting the rows significantly with a high cardinality predicate). Other than the fact that you aren’t using any parallelism, your execution plan looks good for what you’re doing.
The problem with updates
Your real problem is the # of rows you are updating – updates require a lot of work. The blocks have to be read in current mode rather than in consistent or direct path mode, which requires reading current blocks from disk, buffering in the cache and various concurrency mechanisms surrounding cache buffers. It has to lock the rows (and wait on any blocking locks by other users). It has to reformat the block. It has to generate undo records. It has to generate redo records and wait on those to flush to disk. And, if the volume is high enough and your cache small enough and there are too few DBWR processes, DBWR can fall behind flushing those dirty buffers and you run out of available buffers in the cache, which will stall out the update, with DBWR as your bottleneck. Lastly, if you are updating any indexed columns, you have to modify the index blocks as well. At large volumes, updates (and deletes) are your enemy, to be avoided as much as possible.
#1 Reduce Row Updates
So, you first priority is to reduce the # of rows updated. As written, your query will update 100% of the rows for a given status:
MERGE INTO account b
USING ( SELECT id, COUNT( Account) as numAcc
FROM account
WHERE Status = 'ACTIVATED'
GROUP BY id) ba
ON (ba.id = b.id )
WHEN MATCHED THEN UPDATE SET b.numAcc = ba.numAcc;
1. Don’t update at all
If 100% of those rows have the wrong or a NULL numAcc
and you are populating it initially, then you must update them all. But it would be far better to rearchitect your loading process so that the correct numAcc
value is set in the initial INSERT
so you never have to update it. If you have to, use a work table then join it to account and use that to load (INSERT SELECT
) the final table. With large row values, this approach is always better than a large-scale update operation. The best kind of update is no update at all.
2. Update fewer rows
If however only some (a small percentage) of those rows need changing – most already have the correct value – then you need to weed out the rows that don’t need changing within the USING clause (don’t try to do it simply with the WHERE
sub-clause underneath UPDATE
). Usually that requires pre-joining your source and target table together within the USING
clause, but in your case it’s the same table, so we don’t need to do a join, just compare the current value with the new one using an analytical function and a parent block:
MERGE INTO account b
USING (SELECT *
FROM (SELECT id, numAcc old_numAcc, COUNT(*) OVER (PARTITION BY id) as numAcc
FROM account
WHERE Status = 'ACTIVATED')
WHERE NVL(old_numAcc,-1) != numAcc) ba
ON (ba.id = b.id )
WHEN MATCHED THEN UPDATE SET b.numAcc = ba.numAcc;
#2 Eliminate any dependent objects impacting updates
Once you have reduced the # of rows you are actually updating, the second priority is to eliminate any constraint or sub-object slowing it down. This would be any index on the numAcc
column (there really shouldn’t be an index on a column like that, but check to be sure), as well as any foreign key (FK) on the column. Also look for triggers that fire on update. All these will significantly slow down an update operation. Disable them if possible.
#3 Employ parallelism for the update
The third priority is to throw muscle at the operation. Something like this should be parallelized with PDML (parallel-DML) to take advantage of multiple CPUs. This requires enabling parallel DML at the session level as well as hinting it:
ALTER SESSION ENABLE PARALLEL DML;
MERGE /*+ parallel(b,16) */ INTO account b
USING (SELECT *
FROM (SELECT id, numAcc old_numAcc, COUNT(*) OVER (PARTITION BY id) as numAcc
FROM account
WHERE Status = 'ACTIVATED')
WHERE NVL(old_numAcc,-1) != numAcc) ba
ON (ba.id = b.id )
WHEN MATCHED THEN UPDATE SET b.numAcc = ba.numAcc;
Finally, I’ll comment that I tried to answer your question about performance. However, it should also be pointed out that you really shouldn’t be doing this in the first place. Stamping a row with the # of neighboring rows is a violation of normalization rules and creates ambiguity. What happens if you add a new row? Are you going to redo the update operation every time so that the value is in sync? We need to avoid situations like this. Aim to never have a column in a table that depends on other rows for its value, especially other rows from the same table. It should be solely determined by the key of its own row (not even a non-key column from its own row) and that alone.
So, you’re much better off not having a numAcc
column in your table at all. If you need to know the count, compute it with COUNT(*)
at query time. Remember you can use analytical functions like COUNT(*) OVER (PARTITION BY id)
to easily get this denormalized in a query without having to do a GROUP BY
and change the granularity of your results.
4
Rather than aggregating and correlating on the id
column, you can try using an analytic function and correlating on the ROWID
pseudo-column (which is effectively a pointer to the row):
MERGE INTO account dst
USING (
SELECT activated_count
FROM (
SELECT numacc,
COUNT(CASE WHEN Status = 'ACTIVATED' THEN account END)
OVER (PARTITION BY id) AS activated_count
FROM account
)
WHERE activated_count > 0 -- Your query effectively includes this filter but I'm not sure
-- whether you intended that behaviour or not.
AND numacc != activated_count
) src
ON (
dst.ROWID = src.ROWID
)
WHEN MATCHED THEN
UPDATE SET dst.numAcc = src.activated_count;
fiddle
7