70-442: Normalization (LD, 4)

From Chapter 4 of “Pro SQL Server 2005 Database Design and Optimization”, by Louis Davidson (LD).

Goal is to eliminate data redundancy (duplication) and promote data integrity by breaking entities into simpler forms (LD likes his tables thin – with no more than two primary attributes, see 4th form :0).

Normalization optimizes storage and updates but can make querying more complicated (more tables – but then again simpler more intuitive tables. Probably ideal solution for larger systems is a highly normalised relational database paired with a flattened reporting database (or a load of views, but then you have to maintain those).

 

1st Normal Form

All attributes must be atomic:  so no arrays or delimited data – unless the  attribute  always has the same number of values, and these are never used seperately (like a coordinate – but then who is to say you migth not need to sum all the Y’s for a given X – but then knowing the format is always x,y you could of course work this out). General rule of thumb: break down values only as far as you need to deal with them seperately. 

All instances of an entity must contain the same number of values: of course all will have the same number of columns, but if you see columns like Payment1, Payment2 – or a real world example NoteCustomer, NoteAdmin – then what you really have is an entity (order in this example) that can have notes associated with it, and initially you have identified two types, but you migth find more, like NoteFulfilmentCompany. So Notes is an attribute (identified during the conceptual phase and recorded in plural to indicate that there can be >1) and the order entity should not contain several values for this (so of course split this into a seperate entity).

All Occurences of an Entity Must Be Different: and not just via an added artificial key. Each instance must be unique in it’s own right. I think the actual enforcement of this is one of the main new things to take away from LD so far. So eventually LD will come round to recommending an artificial (surrogate) PK on each table, but with a unique natural (most often compound) key enforced with a unique contraint.

Clues You are Not In 1st Normal Form: delimited text data, attributes with numbers at their end or very similar names.

 

2nd, 3rd Normal Form – Relationships between Attributes

Aim is to minimize funtional dependency between attributes, and the main principle is “Non-key attributes must provide a detail about the key, the whole key and nothing but the key”. 

2nd form says that each non-key attribute must describe the Entire Key: so say you have a OrderId + BookId composite key in your OrderBook table. If this table then had a BookTile attribute we would be in quite obvious violation of the 2nd form. Each attribute must describe the whole key. SoldPrice would be acceptable as it describes the whole key (also it is normally a good idea to record things like this as the actual price a book was sold for (regardless of the recorded RRP) could vary from day to day.

3rd normal form says a non-key attribute cannot describe another non-key attribute: so this would either be because you missed an entity and are trying to roll it into another (LD examle is extended details about the publisher in a book table) or because you are summarising data. The latter can be justified later on for performance optimization reasons, but has no place during the logical phase.

Clues that you are not in 3rd (and hence 2nd) Form: repeating key attribute prefixes (2nd form, so in example above we had BookId and then Late BookTitle), multiple attributes with the same prefix (3rd form – so we have PublisherThis and PublisherThat) or repeating groups of data or summary rows.

 

Boyce-Codd Normal Form – Relationships between Attributes

A better constructed replacement for 2nd and 3rd form – but understanding those will help understand this one!

Basically it extends 2+3 to take into account multiple keys, and says all attributes must be fully dependant one a key. So same as above, but you could have to natural unique keys. Note of course that surrogate keys are outside of consideration for all this. 

Also Every Determinant must be a key – so this relates to some mathematical definition of a “determinant”: an attribute or combination of attributes, of which any other attribute or combination thereof is functionally dependant. So Boyce-Codd also tells us what must be a key.

The crux of both Boyce-Codd and 2+3 is that every key must uniquely identify the entity and every non-key must describe then entity (BC extends this to a key, but as any key will always identify the entity this makes sense).

 

Fourth Normal Form, Multivalued Dependencies

If you have compound keys with >2 parts you should consider the 4th form, and what you have to look out for is the relationship between those parts. Note you will often hear the term “ternary relationship” when dealing with this:  “ternary: consisting of or involving three; threefold; triple”

It’s difficult to summarise this in a simple sentence, but “break ternary relations into two entities relating to a third” is the main thing to take away.

Examples are needed I think, and indeed LD’s section on this takes the shape of two examples. In one we have a ClassAssignment table with a compound key of three FKs: Class, Teacher and Student. The 4th normal says that “there must not be more than one multivalued dependency between an attribute and the key of the entity” – and in our example we have exactly that: depending on how you design this a Class can have >1 Teachers (possible) and >1 Students, a Teacher can have >1 Students and >1 Classes and of course a Student can and will have >1 Teachers and >1 Classes.

So updating and changing any one of the three parts in this key will alwyas affect the other two parts. I cannot change the teacher for a class in this table, without having to consider all the Students for that Teacher/Class combination. Might not sound like a big deal, but if you think about it: we want Mr Robinson to teach Advanced Math 3 in stead of Mr Picklehead – but in our table we would have to update 32 rows – one for each (miserable) Student in that class. And also we will have duplicated values for the Class and Teacher columns in each of those rows.

So what is the solution? Well as it said at the beginning, if you can then split this into two entities relating to a third, but which third? I think it’s hard to give a surefire coverall step-by-step guide on how to solve these problems. I think personally I would say look for the main function or aim – what are we after or trying to achieve? We are trying to organise classes! Possibly you could also look for something that would render all the other parts pointless, so for instance you could take away Students but you could still make sense of the Teachers teaching Class relation, likewise remove Teacher and you still have Students attending Classes, but without Classes, Students and Teachers would not have any purpose. Either way – you identify Class as the third or main seperate entity – so start from here and we end up with a Class table, and then two many-to-many tables, a TeacherClass and a StudentClass. This ensures that we never have >1 multivalued dependency between key and attribute in any of the three entities involved.

Another example LD gives is Sessions with a Room and a Presenter, where the Session might span >1 room. Again I think you can apply the above – but there is some scope for common sense and experience in here.

Finally let’s quickly note that LD points out that many see the 3rd (or Boyce-Codd form) as sufficient normalization. This is probably in no small part thanks to the fickleness of the 4th form and the difficulty of stating the aim in one sweet easy to understand sentence. I think the approach to take is to realise that there are serious problems to avoid (the 32 student rows) and always bear this in mind when you see a 3 or more part compound key.

 

Fifth Normal Form, when you can’t do the above!

So – not all ternary relations can be broken down into our 1-to-2 solution above, fifth form tells us that when this is the case then if it is possible to decompose the relation into seperate entities, we should do so!

How much more vague can we get? I think this is where real world experience becomes essential, so for now let’s just note that people like DL are telling us that 4th and 5th form are important, so next time you are getting ternary make sure to whip out your DL book and scrutinize the hell out of what you are doing.

These are the sort of hard learned lessons that you sometimes feel you have to learn the hard way I guess. But just keep in mind, if you can break it down into something simpler, you probably should!

 

Denormalization

So this is optimizing for performance, LD tells us to never do this during the logical phase (we must remain pure), but to not start crying when we have to do this during implementation.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: