I’d like to understand why the following 2 queries that select duplicates from the companies
table have different execution times. The second query with the JOIN
is executed much faster.
The query is executed on:
- a
companies
table with ~200k records - of which there are 195k unique
name
column values (each query returns ~1500 results, namely the duplicates) - there is a
nameindex
index on thename
column
WHERE (0.31142875 s):
SELECT *
FROM `companies`
WHERE `name` IN (
SELECT `name`
FROM `companies`
GROUP BY `name`
HAVING COUNT(`name`) > 1
);
JOIN (0.07034850 s):
SELECT *
FROM `companies`
INNER JOIN (
SELECT `name`
FROM `companies`
GROUP BY `name`
HAVING COUNT(`name`) > 1
) AS `duplicate_names` ON `companies`.`name` = `duplicate_names`.`name`;
Note that the subqueries are exactly the same. Why is it that in this specific setup the second query is faster?
The output from EXPLAIN
is:
WHERE query:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | companies | NULL | ALL | NULL | NULL | NULL | NULL | 195258 | 100.00 | Using where |
2 | SUBQUERY | companies | NULL | index | nameindex | nameindex | 1022 | NULL | 195258 | 100.00 | Using index |
JOIN query:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | NULL | ALL | NULL | NULL | NULL | NULL | 195258 | 100.00 | NULL | |
1 | PRIMARY | companies | NULL | ref | nameindex | nameindex | 1022 | duplicate_names.name | 1 | 100.00 | NULL |
2 | DERIVED | companies | NULL | index | nameindex | nameindex | 1022 | NULL | 195258 | 100.00 | Using index |