Posted on 15 Comments

Finding useless indexes

I’ll say beforehand that the following is not very clean – for a number of reasons. I’ll discuss some issues after the query. The query returns all indexes in a db where the cardinality is lower than 30% of the rows, thus making it unlikely that the server will ever use that index.

SELECT s.table_name,
       concat(s.index_name,'(',group_concat(s.column_name order by s.seq_in_index),')') as idx,
       GROUP_CONCAT(s.cardinality ORDER BY s.seq_in_index) AS card,
       t.table_rows
  FROM information_schema.tables t
  JOIN information_schema.statistics s USING (table_schema,table_name)
 WHERE t.table_schema='dbname'
   AND t.table_rows > 1000
   AND s.non_unique
 GROUP BY s.table_name,s.index_name
HAVING (card + 0) < (t.table_rows / 3);

Let's discuss...
The number of rows in a table used here will be accurate for MyISAM; for InnoDB it's essentiall a rough guess.
Cardinality in this context is the number of distinct values in a column. This statistic needs to be updated using ANALYZE TABLE, otherwise it might either not be available or just outdated.
Since there can be composite indexes (index on a,b or more columns, rather than just one column), the cardinality is per column and in the output we show this comma-separated; of course for anything but the first, it's relative to the previous columns. The HAVING clause just ends up using the cardinality of the first column, which statistically is not really correct. If someone has a bright idea on how to grab or calculate a sensible cardinality figure for composite indexes, please do comment.
And yes, some selected columns (like t.table_rows) fall outside the grouping, however they remain the same so that's "ok" for this quick hack and MySQL allows this unless you have sql_mode=ONLY_FULL_GROUP_BY enabled.

The fact that the calculated cardinality figure (for composite indexes) is dodgy may not actually be relevant for the purpose of this query. The point of the query is that generally, the server will not use an index if it needs to look at >30% of the rows anyway. That's not *actually* how it works inside the server (as Peter Zaitsev can explain in great detail 😉 but as an easy rule-of-thumb it's close enough to reality. That's not the whole story, because if all the columns referenced in a query are in the index you have a so-called covering index, and the server knows that scanning an index is quicker than scanning the table. So in that case, an index might still make sense, but this query can't know about that case.
Also, for very small tables, the server will prefer a table scan anyway; to weed out the worst of this, only tables with >1000 rows are shown in the output. Again, a rather crude filter with plenty of flaws. But again, it may suffice for this purpose.

Normally, what you'd do is pick out the obvious indexes on yes/no, male/female, active/deleted style columns. They're easy to spot. But if you run the above query, much more may show up. Do unused indexes hurt? Well, they slow down INSERT/UPDATE/DELETE and ALTER TABLE statements. So this query can be useful as a rough filter, then human brains can be used to decide which indexes to keep and which ones to get rid of.

Posted on 15 Comments

15 thoughts on “Finding useless indexes

  1. “the server will not use an index if it needs to look at >30% of the rows anyway”, that means if an index has 4 different values, server will only scan 25% rows in average and use this index.

  2. You will find that generally, yes it will indeed do that.
    Prove me wrong – use MyISAM, and make sure the table has sufficient # of rows to make indexes be used at all.

  3. Hi!

    mysql> create table foo (i int, key (i)) engine = myisam;
    Query OK, 0 rows affected (0.02 sec)

    mysql> insert into foo values (1), (2);
    Query OK, 2 rows affected (0.10 sec)
    Records: 2 Duplicates: 0 Warnings: 0

    mysql> show session status like ‘handler_read%’;
    +———————–+——-+
    | Variable_name | Value |
    +———————–+——-+
    | Handler_read_first | 0 |
    | Handler_read_key | 0 |
    | Handler_read_next | 0 |
    | Handler_read_prev | 0 |
    | Handler_read_rnd | 0 |
    | Handler_read_rnd_next | 29 |
    +———————–+——-+
    6 rows in set (0.00 sec)

    mysql> select * from foo where i = 1;
    +——+
    | i |
    +——+
    | 1 |
    +——+
    1 row in set (0.15 sec)

    mysql> show session status like ‘handler_read%’;
    +———————–+——-+
    | Variable_name | Value |
    +———————–+——-+
    | Handler_read_first | 0 |
    | Handler_read_key | 1 |
    | Handler_read_next | 1 |
    | Handler_read_prev | 0 |
    | Handler_read_rnd | 0 |
    | Handler_read_rnd_next | 36 |
    +———————–+——-+
    6 rows in set (0.00 sec)

    mysql> explain select * from foo where i = 2;
    +—-+————-+——-+——+—————+——+———+——-+——+————————–+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+——-+——+—————+——+———+——-+——+————————–+
    | 1 | SIMPLE | foo | ref | i | i | 5 | const | 1 | Using where; Using index |
    +—-+————-+——-+——+—————+——+———+——-+——+————————–+
    1 row in set (0.00 sec)

    mysql> alter table foo add j int;
    Query OK, 2 rows affected (0.08 sec)
    Records: 2 Duplicates: 0 Warnings: 0

    mysql> update foo set j = i;
    Query OK, 2 rows affected (0.26 sec)
    Rows matched: 2 Changed: 2 Warnings: 0

    mysql> explain select j from foo where i = 2;
    +—-+————-+——-+——+—————+——+———+——-+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+——-+——+—————+——+———+——-+——+————-+
    | 1 | SIMPLE | foo | ref | i | i | 5 | const | 1 | Using where |
    +—-+————-+——-+——+—————+——+———+——-+——+————-+
    1 row in set (0.00 sec)

    mysql> show session status like ‘handler_read%’;
    +———————–+——-+
    | Variable_name | Value |
    +———————–+——-+
    | Handler_read_first | 0 |
    | Handler_read_key | 2 |
    | Handler_read_next | 2 |
    | Handler_read_prev | 0 |
    | Handler_read_rnd | 0 |
    | Handler_read_rnd_next | 56 |
    +———————–+——-+
    6 rows in set (0.00 sec)

    mysql> select j from foo where i = 2;
    +——+
    | j |
    +——+
    | 2 |
    +——+
    1 row in set (0.00 sec)

    mysql> show session status like ‘handler_read%’;
    +———————–+——-+
    | Variable_name | Value |
    +———————–+——-+
    | Handler_read_first | 0 |
    | Handler_read_key | 3 |
    | Handler_read_next | 3 |
    | Handler_read_prev | 0 |
    | Handler_read_rnd | 0 |
    | Handler_read_rnd_next | 63 |
    +———————–+——-+
    6 rows in set (0.00 sec)

    I make that 50% 🙂

    Really, the 30% rule is nonsense, there is nothing like that in the code at all. 😉

  4. An index on a low cardinality column can still be of use with InnoDB when there is a lot of skew in the column. Index selectivity is computed by sampling so InnoDB can detect the skew and will know when the index is useful. Assume there is a table Jobs with a State column (State = {pending, ready, failed } and that few Jobs have State = failed. Then ‘select * from Jobs where State = “failed”‘ can use the index on State.

    Do any other storage engines use histograms to handle this case?

  5. I know it’s not in the code like that.

    And I’m sorry, but your example does not prove anything, except show very common and simple behavioural patterns of the MySQL optimiser.
    In the first instance, an index scan is used and the rows are not accessed because all accessed columns are contained in the index (The “Using index” in the EXPLAIN).

    Try the second example with more rows please, and see how you fare then. You’re trying to make (or break 😉 a statistical point based on a sample size of 2. Come on now… not saying you’re wrong, but you really do need to provide effective proof for the case, if you want to have a case. I already noted in my initial post that it only makes sense for a table that’s of more significant size.

  6. Indeed, I wrote that InnoDB is different. I wonder if the behaviour you describe is perhaps more of an accidental side-effect of the sampling method, rather than a designed feature. I say this because it can also go badly wrong, in particular with larger datasets. The sampling uses a fixed dumber of index dives (since InnoDB doesn’t know the size of the dataset). I’ve dealt with customers where the optimiser, based on the info passed back from InnoDB, made some desparately bad choices. Run ANALYZE TABLE again on the same table, get different stats, get a nice result. Run analyze again (or restart server) and it can be wrong again…. that’s luck, not design 😉

    I’m not aware of other engines using sampling. None of the main/common ones anyway, as far as I know. Happy to be corrected though.

  7. Hi Arjen!

    Remember that for GROUP BYs on columns with few values, the MyISAM can use the index effectively, even if there are few distinct values. Sparse indexes can be effective for GROUP BY stuff and combined into multi-column indexes. Whether the sparse column is prefixing the index or suffixing it depends on the type of queries you’re running, too…

    Also, a similar I_S query I wrote a while ago shows poor index choices based on a selectivity %:

    http://forge.mysql.com/tools/tool.php?id=85

    Cheers,

    Jay

  8. InnoDB does two types of sampling. 8 index leaf blocks are sampled to compute index cardinality stats and resampled after enough of the table has changed. The index cardinality stats in MySQL cannot represent skew, so join orders need to be hinted now and then. InnoDB also samples to compute index selectivity and that is the case where indexes on InnoDB tables can work for low cardinality columns when there is skew.

  9. The point is that it’s somewhat unpredictable. With what you’re saying, EXPLAIN doesn’t necessarily reflect what’s going to happen (next time).
    And I’ve observed reasonably simple queries taking way too long, seemingly through the wrong choice of index and such.

  10. I didn’t say you were wrong. Just your condition “HAVING (card + 0) < (t.table_rows / 3)" is too loose that you will get too many indexes which may be perfect.

    CREATE TABLE `u` (
      `id` int(11) NOT NULL auto_increment,
      `name` varchar(20) default NULL,
      `email` varchar(50) default NULL,
      `address` varchar(50) default NULL,
      `age` int(11) default NULL,
      `regist_time` datetime default NULL,
      PRIMARY KEY  (`id`),
      KEY `name` (`name`)
    ) ENGINE=MyISAM AUTO_INCREMENT=5168827 DEFAULT CHARSET=utf8;
    
    select index_name, cardinality from information_schema.statistics where table_name = ‘u’;
    +————+————-+
    | index_name | cardinality |
    +————+————-+
    | PRIMARY    |     5068825 |
    | name       |      112640 |
    +————+————-+
    2 rows in set (0.01 sec)
    
    explain select * from u where name = ‘mathias’;
    +—-+————-+——-+——+—————+——+———+——-+——+————-+
    | id | select_type | table | type | possible_keys | key  | key_len | ref   | rows | Extra       |
    +—-+————-+——-+——+—————+——+———+——-+——+————-+
    |  1 | SIMPLE      | u     | ref  | name          | name | 63      | const |   34 | Using where |
    +—-+————-+——-+——+—————+——+———+——-+——+————-+
    
  11. How many rows that that table have? Doesn’t the formula work there?
    With the cardinality showing as it is, I’d expect the formula to evaluate to true.

    Also, do you have any suggestions for improving the query?

  12. It has 5068825 rows. Obviously cardinality of index `name` is less than 30% rows. But I don’t have any other way but adding this index to improve the query.

  13. it’s not accidental; you don’t actually care about the cardinality of the column but rather the selectivity of the column as it relates to the value you’re looking for.

    For instance, let’s take a column that has 2 distinct values (say 0 and 1), where most of the values are 1; say 85%. If you’re always searching on 1 the index will never be used; if you’re always searching on 0 the index will always be used. Your formula gets indexes that have a *cardinality* that is less than 1/3 of the # of rows.

    But whether or not the index is useless really depends on what you’re searching on. Sure, it’s not the best schema design to have boolean flags anyway, but if this is a sales table and 85% of the orders are fulfilled and you’re only ever searching on the unfulfilled orders, the index is very very useful.

  14. I understand from High Performance SQL that a common mistake is to make an explicit index for a column that has an implicit index, e.g., a foreign key constraint. I would love to see a query that would uncover this type of mistake. — Charles

Comments are closed.