skip to main bit
a man slumped on his desk, from 'The Sleep of Reason Produces
      Monsters'

Oblomovka

Currently:

notes on: crash course in database design: surviving when you don’t have a dba

Okay, now I’m late arriving at Dirk Elmendorf’s crash course in database design. We join him just after he’s explained why you should care about db design, and where to bury your DBA’s body when you’ve accidentally poisoned his coffee. I think.

So first steps: you have entities, and relationships. Entities are bits of data, relationships are connections between them. You don’t have to have relationships between every entity (he’s really seen databases that are like that).

Easiest example: a one-to-one relationship. A professor and a private office, say. So one professor has one office. Then there’s one-to-many. One classroom may have many classes occur in it. Many to one: a class has many classrooms. Then there’s the complicated one, the many to many: if a class was repeated in a bunch of classes.

SQL doesn’t handle many-to-many. So you have to insert something in the middle, so you have a many-to-one, and then a one-to-many. So in this case, you create a new entity: a class schedule. A class relates to a single time and data, and a classroom relates to a single date and time.

(Editorial confession at this point: I learn everything I know about database design from futzing around with the Access visual designer. Relational DBs give me the heebie-geebiesSo I’m now hitting the extent of my knowledge. I may be mangling what Dirk is saying here- d.)

Building your database: use consistent coding standards (you putzes!). Single and plural table neames, upper case, underscore, camelcaps, special table prefixes — e.g ref_units, where “ref” is a prefix for contant reference values.

Normalisation! Hooray! Normalisation is a set of rules to designing schmea. It helps you reduce redundancy. Redundancy is bad because it causes errors, and wastes resources. (Ed: Also I think there’s a commandment against it in the Bible. Or there was until they refactored them).

It eliminates database inconsistencies -poorly laid out data can provide false statements.

Normalization – first Normal form. The easiest one: all records have to save the same “shape” – the same rows, the same data. One way to cheat is to have a fake array in your table, like “author_name1”, “author_name2”, “author_name3”. What happens when you have more than three authors? What do you do when you delete an author. Author3 sometimes becomes a comma-separated field of the rest of the authors. Yuk. Address1,2,3 is very common – but databases can store newlines now.

Second normal form. Row can’t have partial key dependences. So having a field in the employee table marked “office_location”, then you’re mixing ideas, because if you delete all the employees, your office_location will disappear too. Move it out to its own table.

Third normal form. One non-key field cannot be a fact about another non-key field. So you can’t have a book table, with a publisher *and* a publisher’s address. You need to break that out again – a separate publisher table.

Fourth normal form. A record should not contain two or more multi-valued facts which are independent. So a class table that has “room” and “professor”. You really want to move those out into their own classes, because those facts aren’t related.

(Ed: bascally, databases seem to crave to become triplets. I can see why people turn to RDF religion after this.)

Fifth normal form: information cannnot be represented by several smaller record types. So basically, once you’ve gone for the four normal forms, stop breaking out the bloody tables, because you’re GOING TOO FAR.

Normalization – you can have too much of a good thing. Start at a normalized database and work backwards as you need to optimize.

Practical tips: ideally, put data checking into a central library, and then make sure all data is run through it before it enters the database.

Unfortunately, the usual case is that you don’t have enough control over access to the database – the db is used by multiple applications, languages, and the integrity of the data is more important than performance. So you need to put data-checking inside the db.

A primary key is a non-null unique identifier for a row, a composite key is a collection of fields that give you a non-null unique identifies. A foreign key is a primary key that’s stored in a different table. A foreign key constraint means that a foreign key has to appear as a primary key in another table.

Cascading updates and deletes. Cascade allows you to handle foreign tables that your application may not even be aware of. ON UPDATE CASCADE, handles updating of the foreign key. On delete cascade deletes the row that is referencing a foreign key which is being deleted from the primary table of the foreign key. ON DELETE CASCADE is daaaaangerous. You might want it to delete logs, but you really don’t want all records of this person disappearing when you delete something. ON DELETE CASCADE SET NULL – so you can just make individual record values to disappear.

Column restaraints. Default to NOT NULL, because NULL can cause real problems with sorting, etc. UNIQUE – you can use this for multiple columns.

CHECK – stuff like CHECK( age > 0). You’re looking for data validity, but you’re not putting business logic in your database. You want to leave that to a proper DBA.

Triggers! Advanced data checking, or to scrub data – automatically lowercasing stuff. Can also handle more advanced clean up – so deleting a single table can cause a trigger to clean other tables that are related and log the event to a log table. But triggers are daaangerous for programmers, because they’re hard to maintain in the development cycle, and invisible when debugging.

Dirk’s DBA told him to talk more about indexes. Indexes are cheat-sheets for the database to improve performance. But you need to pay attention to actual queries to figure where they are needed. Index a lot, but remove the ones that are not being used. Don’t bother indexing a column which has a small amount of possible values for a lot of rows. Boolean/value ids that have a short list.

Conclusions: DB design isn’t new or cutting edge, so there’s a ton of literature out there to help you learn more about database.

Just because you don’t have are not a DBA doesn’t mean you should build on top of a poorly designed database.

Comments are closed.