Posted on 19 Comments

SQL evil: the NATURAL join keyword

Apart from lots of interactivity, I tend to use anecdotes and humour when teaching a MySQL training course, as it really helps people to remember.

Sometimes I call something “evil” (no I’m not religious), and attach a little story to it. One example the NATURAL join. It’s not a join type (like inner, left outer, right outer), it’s a modifier keyword that can be used with any join type, and it directs the way the join is resolved.

Instead of specifying a join condition through ON, USING or a WHERE clause, the NATURAL keyword tells the server to match up any column names between the two tables, and automatically use those columns to resolve the join. For example, if two tables that you’re doing a natural join on have a column foo_id, the server would automatically use that column. But of course a server has no idea about context, so it’s indiscriminate. There can be multiple columns that have a name which is also used in the other table, they will ALL be used for the join (like a.id=b.id AND a.name=b.name)

I like explicit join syntax, I try to steer people away from comma/WHERE constructs. But NATURAL is a whole other level of evil, because it’s dangerously implicit: you can’t tell by looking at the query what columns are going to be used! If, at any point, a column name is changed, or a column added in one table, or a column is removed (or normalised away to another table again), the join can (equally magical) break by either including fewer, or more, columns.

Thus I reckon that NATURAL joins are just nasty beyond belief, and I’ve been putting this to my students: if you ever find a dev using this, a) fire them on the spot and b) tell me about it! Of course there’s an aspect of humour in there, but it gets the point across. Have you ever seen NATURAL join used anywhere? Please do tell in a comment!

Posted on 19 Comments

19 thoughts on “SQL evil: the NATURAL join keyword

  1. I’ve seen it used, but only in environments where there were very strict rules about database and field names. ie, there would never be a field called id, it would be a_id, so that you’d have something like user_id and company_id. Then again, every table had a field ending with _desc, one with _created and one with _updated as well. So not the world’s best schema, but it worked really well because everyone followed the rules.

    It’s actually quite handy in those environments actually (akin to how the USING clause is handier than the ON clause for fields with the same name). But that was back when you had to submit forms in triplicate to get a table created; nowadays it’s sadly very common to have people with no DBA experience creating tables and queries.

    By the way, changing a field name and removing a field will cause *any* join using that field to break.

  2. > By the way, changing a field name and removing a field will cause *any* join using that field to break.

    Except that with a natural join, the query may technically still work, but return wrong results (wrong in the sense of “that was not what was intended”).

  3. *nod* but either way, that’s what testing is for. If you put natural joins in your code you should make sure to have regression tests so that they’re doing what you want.

  4. I asked Krow if maybe it was worth dropping natural join from Drizzle, and he wants it to stay in because 1) he uses it all the time, 2) he wants it to implement some magic about inheritable tables, and 3) a bunch of stuff in Perl uses it.

  5. I remember you ranting to me against both comma joins and NATURAL joins.

  6. The purpose of NATURAL joins is when the DBMS properly implements referential integrity (foreign key/refs) and so if a relationship exists between two tables, a natural join would naturally use the relationship to perform a join.

    It is only “evil” in MySQL because MySQL does not implement any complete support for referential integrity… The bit that InnoDB has is not enough.

    Everyone is better off not using NATURAL and InnoDB’s partial implementation of FK rather than trying to pretend that it sometimes kinda-works.

  7. bad krow.

    And stuff in Perl uses it? Eek! Got specifics?

  8. Are you saying that another server would only pick columns that have a foreign key constraint on them?

  9. A NATURAL join would/should automatically pick an index where a foreign key constraint satisfies… The optimizer would not need to be consulted. The optimizer would typically be forced to use the index specified by the foreign key constraint.
    When a FK is not found linking the two tables in some way, a warning should be issued before attempting to do a stupid blind by-name join.

    Fun/powerful things can be done when NATURAL is done properly and hidden in the background may be reference tables, not used in the query, which are pulled in to satisfy the join: Imagine if you had a table ‘C’ which contains the PK of both tables ‘A’ and ‘B’. So we have “A <-- C --> B”. When the user requests “A NATURAL JOIN B”, he would get an many-to-many join of A and B using C.

    I haven’t done enough playing around to see what DBMS implement this fully… but I expect a number of them already do since this is the design intent “relational” databases.

  10. Indexes are an implementation issue, SQL standard does not care for them.

    For instance, most database implementations choose to use an index to keep an eye on unique constraints, i.e. they actually add an index when you specify a column as UNIQUE; but they don’t have to, it’s a choice. As long as they get the job done.

    The A-B-C trick is interesting, but frameworks like Rails already do this through convention as well.

  11. I am unsure of which particular form of evil you are referring to.

    Fortunately I have yet to come across a natural join outside of SQL documentation.

  12. The comma/WHERE construct is a separate issue from the natural join. You will find both in the documentation for pretty much any SQL database product though.

    You asked for an example for comma/where:

    SELECT ... FROM a,b WHERE ...

    Now imagine there’s there many more than just 2 tables, and lots of other stuff in the WHERE clause. Just regular real world stuff. Then, how can you easily tell from that huge query which join condition is used for a table, and even IF a join condition is used?
    A join condition is not obligatory, if you leave it out (or accidentally delete it) you just get a cartesian product (each row of one combined with each row of the other).

    If, however, you write the above as
    SELECT ... FROM a [INNER] JOIN b ON (a.id=b.id)
    then it’s much clearer. I put the INNER keyword in square brackets because it’s optional to specify: unless a join is left/right (outer), it has to be inner.
    Instead of an ON (condition) clause you can also use USING (id) if you want, provided the column names match up between the two tables (but contrary to NATURAL, the USING construct is explicit: *you* specify which columns are used).

    All this is just about good habits, coding standards, and maintainability.

  13. “that’s what testing is for”
    I disagree. Testing should not be trouble with total mayhem anytime someone renames a column. I think it is an expectation that cannot be fulfilled.

    I believe it is good for code (or good SQL) to break when conditions change in such manner that affects its results. Getting compiler errors or SQL errors is the most direct way of knowing something went wrong.

    Regards,
    Shlomi Noach

  14. Is there a specific DB implementation you know of that does so?

    I personally still prefer explicitly stating what you’re joining on. It’s clearer SQL when you specify everything (though it means *more* SQL to code). There are no ambiguities and everyone knows exactly what you mean.

    Regards,
    Shlomi Noach

  15. Hi!

    “the NATURAL keyword tells the server to match up any column names between the two tables, and automatically use those columns to resolve the join.”

    Well, that, plus that multiple ‘duplicate’ join columns are coerced into one column. So:

    SELECT * FROM a NATURAL JOIN b

    would be be resolved as

    SELECT * FROM a INNER JOIN b ON a.film_id = b.film_id

    and then the column pair {a.film_id, b.film_id} will appear as something equivalent to:

    COALESCE(a.film_id, b.film_id) AS film_id

    A similar thing happens in the named columns join (the one with USING)

    @Anthony: Hi! Sorry to point out you’re wrong here. At least there is nothing in the SQL standard that implies NATURAL join has anything to do with referential integrity. This is the text:

    ISO/IEC 9075-2:2003 (E)
    7.7

    a) If NATURAL is specified, then let common column name be a [field name] that is equivalent to the
    [field name] of exactly one field of RT1 and the of exactly one field of RT2. RT1 shall
    not have any duplicate common column names and RT2 shall not have any duplicate common column
    names. Let corresponding join columns refer to all fields of RT1 and RT2 that have common column
    names, if any.
    b) ….
    c) Let C1 and C2 be a pair of corresponding join columns of RT1 and RT2, respectively. C1 and C2 shall
    be comparable. C1 and C2 are operands of an equality operation, and the Syntax Rules of Subclause 9.9,
    “Equality operations”, apply.

    So, in short – it says columns with the same names are compared for equality.

    However, I do see what you are on about. I have written a proposal for a “foreign key join” a while ago here:

    http://rpbouman.blogspot.com/2006/04/intelligent-sql-join-syntax_09.html

  16. Hi!

    wondering why you didn’t approve my earlier comment on this post….

  17. Euh don’t see one here – and “my” is a bit vague in this context since you’re doing anonymous posts.

  18. Can anybody tell the syntax of natural join without keyword NATURAL JOIN????

    1. Abhijit, a natural join is only used if it’s indicated with the keywords NATURAL JOIN. Thus, there is no syntax without.

Comments are closed.