Posted on 9 Comments

Horde’s schema II

Following up on my post from last week… https://openquery.com.au/blog/hordes-schema

Looking at the comments, I think people got the point. It’s not a nice design.
And yes, the large varchar id fields are in fact used as foreign keys in the overall schema, so the result is very inefficient in terms of storage, disk access requirements, use of memory buffers and query performance.

One person wonders what the problem is, if performance is not an issue.
But who decides that it isn’t? The Horde family of tools can be used to provide web mail, file management and other functionality to a company or a web host. If it’s a bigger company, or a web host serving many virtual domains, I’m pretty such that performance matters. You don’t want to waste any resources.

Interestingly, this person is also the very one I sometimes have very interesting debates with about portability. And that’s probably where the core of our difference of opinion lies.
I reckon that for serious performance, you will need to ditch portability. I hold that with a highly scaled environment (involving MySQL as well as other components), you will very likely want to make use of vendor specific extensions and tricks, abandoning portability. You can of course just tune things in the best way possible given a generic environment, but chances are you already considered the higher level components when choosing the hardware, operating system, RAM and disk storage, and other lower level components. If you keep that generic also, you tend to waste money – and most businesses don’t have money to waste.
Businesses want portability (like the option to use another RDBMS) as a kind of insurance. But unless they pay the very high premium for a generic underlying infrastructure, and would be willing and able to bear the great expense of a migration, it’s just never going to happen. So why would you, at the very core of your decision making process, plan for something you will never do?

But the thing is that the basic smart ideas about designing relational schemas hold true even when maintaining portability. The Horde schema can be used with a variety of db servers, and uses old PHP PEAR DB toolkit. Perhaps that’s where things went evil. From what I can tell, the schema just presumes that there’s no way it can have short primary keys. That could be either a PEARDB quirk or a Horde implementation issue.

Certainly, each RBDMS implements sequences in a different way, it’s not standardized. Many implement sequences (but in slightly different ways) and MySQL has auto-increment (in 5.0 you can magic a sequence also, with a stored function and a trigger: I blogged on this earlier). When building an abstraction layer (the key tool for db portability) you should handle all that so that the application doesn’t need to worry about it. In the end, you can get neat short primary keys (integer) and have efficient resource use and joins. JDBC can do it. Not a problem.

At least in this particular case, decent design apparently went out the window completely – and possible causes are that the developers thought that either this was the only/best way to keep it portable, and/or that it wouldn’t matter for performance. That’s a pity. Both those things are innocent and unnecesary victims of the process!

Posted on 9 Comments

9 thoughts on “Horde’s schema II

  1. Are you only talking about the user id? Or are you also talking about replacing the full primary key with a surrogate key?

    Regarding the performance argument: Who says that they do not care more about having an easily maintainable schema, where a single error does not turn half of your data into meaningless data junk because all your surrogate keys are meaningless in the grand scheme of things?

    You are right .. Horde is mostly a one size fits all application, which means they need to balance things.

    Anyways PEAR::MDB2 can emulate sequences, emulate auto increment or can use either or depending on what the RDBMS supports. So yeah PEAR::DB is old 🙂

    regards,
    Lukas

  2. The schema could do with a complete redesign. No sense picking on a few details 😉

    I really don’t see which things have been balanced in this case?
    Sorry, I just don’t buy your argument about them perhaps wanting to prevent errors rather having an efficient and easily maintained schema and dataset. If a db server or app code is so dodgy that it mucks up data like that, there’s bigger trouble even with a schema like this one.
    So that very much feels like a bogus argument.

  3. Like I said I do not think that every table should get an autoincrement column and I also do not think that every referenced key needs to be an integer. That leads to harder to maintain schemas, needless joins.

    In some cases it might to lead to higher storage requirements, in others it might reduce the storage requirements. In some cases it might lead to better performance and in others it might not. Benchmark your usage patterns with real world data.

    But like you said on IRC, you were just showing a tiny snapshot of their schema and we have not even looked at the data and usage patterns inside the application at all. So there might be things seriously wrong, but I was just hestitant to agree with you and your commenters from the little bit I saw.

  4. I guess your point was to aim at just teaching people what to look for in their keys? So what are some good considerations when deciding on your keys (aside from what normalization teaches you already)?

    – fixed length columns are good (note that on many RDBMS sort memory buffers are determined by the possible maximum value for a column and not by the actual values, so make sure you do not just use a large varchar for no good reason)
    – short is good
    – integers are good
    – changes never (or really rarely)

    These recommendations seem to push surrogate ascending integers keys. However we must be vary of all too simple recommendations like these:

    – it will not be as clear what you are looking at when simply “browsing” your schema, since a reference on a sequential integer surrogate key is likely to tell you much less than a natural key would
    – you might end up needing all sorts of joins you did not previously need
    – in multi user shops that mainly work on the latest data you might run into hot spots: http://pooteeweet.org/blog/211 (though this is possibly a fairly esoteric scenario)
    – etc ..

  5. > you might end up needing all sorts of joins you did not previously need

    You mean, you decently normalise (upto say 3NF) your schema?
    “need” is a funny thing. True, if you first had a spreadsheet and now have relational design, then yes the process of normalization will mean more joins. So what? That’s what RDBMS were made for. The “denormalize for speed” story is old dogma. Unless you’re doing some extreme datawarehousing thing, it really isn’t an issue.

    Secondly, don’t underestimate the point of data integrity and maintainability. Proper normalization gets rid of duplicate data, and that helps keeping your dataset intact. I would at least presume that people care about their dataset… yet with some decisions, one wonders 😉

  6. take the user id. the horde guys are using a varchar here. i presume this is because they also want to support LDAP style id’s.

    now if they want to fetch the preferences for a given user based on their LDAP id they can just issue a query on a single table. now if you introduce a surrogate key as the user id, you would have to do a join.

    i really recommend the following blog post on this topic:
    http://blogs.ittoolbox.com/database/soup/archives/007327.asp

    ah well .. we obviously disagree on some finer points here and thats fine 🙂

  7. > now if you introduce a surrogate key as the user id, you would have to do a join.

    So what?

  8. The duplication or redundancy itself is not really a problem – not if it’s properly controlled by a foreign key constraint.

    However, I find these wide keys extremely impractical. It’s error prone too – when you need to add another column to a key or when you decide to rename a key column you need to sift through all the code to add or modify a corresponding join condition. How painless and easy are these types of changes when you use surrogate keys! Not to mention the fact that you will probably never have to rename a surrogate key column – ever. Something you can’t say when using these broad natural keys.

    Roland

  9. I’ll be the first to admit that parts of Horde’s db schema are less than ideal; heck, you should have seen the _old_ preferences SQL table. And while there are reasons or compromises behind most things, some of it is just flat-out suboptimal. That’s where the community (ideally) comes in. If you wanted to bring some ideas – or better yet, patches! – on how to improve various parts of the SQL schemas to the dev list (http://horde.og/mail/), I’d be happy to work with you on them. Or you can create enhancement tickets on http://bugs.horde.org/.

    For example, some of the groupware apps currently have long primary keys in their tables because we thought we’d need to be able to support UIDs for syncing. Turns out we need a seperate UID field entirely, so really the primary keys should go back to ints (which they were originally). I haven’t done it because, well, it’s slipped under my radar and there are always other things to do. But that doesn’t mean that we think it shouldn’t be done or that it’s not important.

    I hope you’ll take my invitation to help out the general Horde community and perhaps get involved in development.

Comments are closed.