Posted on 2 Comments

TEXT and VARCHAR inefficiencies in your db schema

The TEXT and VARCHAR definitions in many db schemas are based on old information – that is, they appear to be presuming restrictions and behaviour from MySQL versions long ago. This has consequences for performance. To us, use of for instance VARCHAR(255) is a key indicator for this. Yep, an anti-pattern.

VARCHAR

In MySQL 4.0, VARCHAR used to be restricted to 255 max. In MySQL 4.1 character sets such as UTF8 were introduced and MySQL 5.1 supports VARCHARs up to 64K-1 in byte length. Thus, any occurrence of VARCHAR(255) indicates some old style logic that needs to be reviewed.

Why not just set the maximum length possible? Well…

A VARCHAR is subject to the character set it’s in, for UTF8 this means either 3 or 4 (utf8mb4) bytes per character can be used. So if one specifies VARCHAR(50) CHARSET utf8mb4, the actual byte length of the stored string can be up to 200 bytes. In stored row format, MySQL uses 1 byte for VARCHAR length when possible (depending on the column definition), and up to 2 bytes if necessary. So, specifying VARCHAR(255) unnecessarily means that the server has to use a 2 byte length in the stored row.

This may be viewed as nitpicking, however storage efficiency affects the number of rows that can fit on a data page and thus the amount of I/O required to manage a certain amount of rows. It all adds up, so having little unnecessary inefficiencies will cost – particularly for larger sites.

VARCHAR best practice

Best practice is to set VARCHAR to the maximum necessary, not the maximum possible – otherwise, as per the above, the maximum possible is about 16000 for utf8mb4, not 255 – and nobody would propose setting it to 16000, would they? But it’s not much different, in stored row space a VARCHAR(255) requires a 2 byte length indicator just like VARCHAR(16000) would.

So please review VARCHAR columns and set their definition to the maximum actually necessary, this is very unlikely to come out as 255. If 255, why not 300? Or rather 200? Or 60? Setting a proper number indicates that thought and data analysis has gone into the design. 255 looks sloppy.

TEXT

TEXT (and LONGTEXT) columns are handled different in MySQL/MariaDB. First, a recap of some facts related to TEXT columns.

The db server often needs to create a temporary table while processing a query. MEMORY tables cannot contain TEXT type columns, thus the temporary table created will be a disk-based one. Admittedly this will likely remain in the disk cache and never actually touch a disk, however it goes through file I/O functions and thus causes overhead – unnecessarily. Queries will be slower.

InnoDB can store a TEXT column on a separate page, and only retrieve it when necessary (this also means that using SELECT * is needlessly inefficient – it’s almost always better to specify only the columns that are required – this also makes code maintenance easier: you can scan the source code for referenced column names and actually find all relevant code and queries).

TEXT best practice

A TEXT column can contain up to 64k-1 in byte length (4G for LONGTEXT). So essentially a TEXT column can store the same amount of data as a VARCHAR column (since MySQL 5.0), and we know that VARCHAR offers us benefits in terms of server behaviour. Thus, any instance of TEXT should be carefully reviewed and generally the outcome is to change to an appropriate VARCHAR.

Using LONGTEXT is ok, if necessary. If the amount of data is not going to exceed say 16KB character length, using LONGTEXT is not warranted and again VARCHAR (not TEXT) is the most suitable column type.

Summary

Particularly when combined with the best practice of not using SELECT *, using appropriately defined VARCHAR columns (rather than VARCHAR(255) or TEXT) can have a measurable and even significant performance impact on application environments.

Applications don’t need to care, so the db definition can be altered without any application impact.

It is a worthwhile effort.

Posted on 2 Comments

2 thoughts on “TEXT and VARCHAR inefficiencies in your db schema

  1. I think it’s worth mentioning that the number of bytes allocated to VARCHAR in a table is shared across all VARCHAR columns in that table. So if you do create a 16k utf8mb4 VARCHAR column in place of TEXT, you cannot create any more VARCHAR columns in that table.

    I also think a little more science needs to be offered around VARCHAR(255). Your statement reads like 255 is an inherently inefficient choice. And whilst it does indicate a grip on legacy choices of days past, (without real consideration for what length is actually needed,) it is no less efficient than a VARCHAR(200) for any given character set.

    As I understand it, it’s only if you can keep the VARCHAR column under 255 bytes will you get a the slight benefit in efficiency. So for utf8mb4 character set, this would mean an effort to choose a length 60 characters or less. Above that, the next step is the limit on indexing of VARCHARs — for utf8mb4 this would be a motivation to choose 190 characters or below.

    This is my own understanding, and would appreciate being corrected if I’m wrong.

    1. Hi Courtney

      Yep there is a maximum rows size (excluding *TEXT/*BLOB type columns), however in many/most schemas there is generally only one really long item, and perhaps a few of some K. This is even the case with WordPress, where you have the main post content (long), and excerpt (shorter), and then topics and other items (quite short). Sometimes it’s feasible to have it all within the row size, other times you’d want to have the long item as TEXT but everything else inside the row. This still is more efficient, also since most lookups would be on meta-data.

      You are also correct on the VARCHAR(255), it’s indeed not about 255 being inherently inefficient, it simply indicates that no consideration has gone into the matter of “how much do we need”, and that’s the key.
      It’s important to look at data, also on an ongoing basis, and adjust the schema when and where needed. It’s not a static aspect of the infrastructure.
      And yes, using prefix indexing is also very useful.

Comments are closed.