70-442: Normalization (LD, 4)

From Chapter 4 of “Pro SQL Server 2005 Database Design and Optimization”, by Louis Davidson (LD).

Goal is to eliminate data redundancy (duplication) and promote data integrity by breaking entities into simpler forms (LD likes his tables thin – with no more than two primary attributes, see 4th form :0).

Normalization optimizes storage and updates but can make querying more complicated (more tables – but then again simpler more intuitive tables. Probably ideal solution for larger systems is a highly normalised relational database paired with a flattened reporting database (or a load of views, but then you have to maintain those).

 

1st Normal Form

All attributes must be atomic:  so no arrays or delimited data – unless the  attribute  always has the same number of values, and these are never used seperately (like a coordinate – but then who is to say you migth not need to sum all the Y’s for a given X – but then knowing the format is always x,y you could of course work this out). General rule of thumb: break down values only as far as you need to deal with them seperately. 

All instances of an entity must contain the same number of values: of course all will have the same number of columns, but if you see columns like Payment1, Payment2 – or a real world example NoteCustomer, NoteAdmin – then what you really have is an entity (order in this example) that can have notes associated with it, and initially you have identified two types, but you migth find more, like NoteFulfilmentCompany. So Notes is an attribute (identified during the conceptual phase and recorded in plural to indicate that there can be >1) and the order entity should not contain several values for this (so of course split this into a seperate entity).

All Occurences of an Entity Must Be Different: and not just via an added artificial key. Each instance must be unique in it’s own right. I think the actual enforcement of this is one of the main new things to take away from LD so far. So eventually LD will come round to recommending an artificial (surrogate) PK on each table, but with a unique natural (most often compound) key enforced with a unique contraint.

Clues You are Not In 1st Normal Form: delimited text data, attributes with numbers at their end or very similar names.

 

2nd, 3rd Normal Form – Relationships between Attributes

Aim is to minimize funtional dependency between attributes, and the main principle is “Non-key attributes must provide a detail about the key, the whole key and nothing but the key”. 

2nd form says that each non-key attribute must describe the Entire Key: so say you have a OrderId + BookId composite key in your OrderBook table. If this table then had a BookTile attribute we would be in quite obvious violation of the 2nd form. Each attribute must describe the whole key. SoldPrice would be acceptable as it describes the whole key (also it is normally a good idea to record things like this as the actual price a book was sold for (regardless of the recorded RRP) could vary from day to day.

3rd normal form says a non-key attribute cannot describe another non-key attribute: so this would either be because you missed an entity and are trying to roll it into another (LD examle is extended details about the publisher in a book table) or because you are summarising data. The latter can be justified later on for performance optimization reasons, but has no place during the logical phase.

Clues that you are not in 3rd (and hence 2nd) Form: repeating key attribute prefixes (2nd form, so in example above we had BookId and then Late BookTitle), multiple attributes with the same prefix (3rd form – so we have PublisherThis and PublisherThat) or repeating groups of data or summary rows.

 

Boyce-Codd Normal Form – Relationships between Attributes

A better constructed replacement for 2nd and 3rd form – but understanding those will help understand this one!

Basically it extends 2+3 to take into account multiple keys, and says all attributes must be fully dependant one a key. So same as above, but you could have to natural unique keys. Note of course that surrogate keys are outside of consideration for all this. 

Also Every Determinant must be a key – so this relates to some mathematical definition of a “determinant”: an attribute or combination of attributes, of which any other attribute or combination thereof is functionally dependant. So Boyce-Codd also tells us what must be a key.

The crux of both Boyce-Codd and 2+3 is that every key must uniquely identify the entity and every non-key must describe then entity (BC extends this to a key, but as any key will always identify the entity this makes sense).

 

Fourth Normal Form, Multivalued Dependencies

If you have compound keys with >2 parts you should consider the 4th form, and what you have to look out for is the relationship between those parts. Note you will often hear the term “ternary relationship” when dealing with this:  “ternary: consisting of or involving three; threefold; triple”

It’s difficult to summarise this in a simple sentence, but “break ternary relations into two entities relating to a third” is the main thing to take away.

Examples are needed I think, and indeed LD’s section on this takes the shape of two examples. In one we have a ClassAssignment table with a compound key of three FKs: Class, Teacher and Student. The 4th normal says that “there must not be more than one multivalued dependency between an attribute and the key of the entity” – and in our example we have exactly that: depending on how you design this a Class can have >1 Teachers (possible) and >1 Students, a Teacher can have >1 Students and >1 Classes and of course a Student can and will have >1 Teachers and >1 Classes.

So updating and changing any one of the three parts in this key will alwyas affect the other two parts. I cannot change the teacher for a class in this table, without having to consider all the Students for that Teacher/Class combination. Might not sound like a big deal, but if you think about it: we want Mr Robinson to teach Advanced Math 3 in stead of Mr Picklehead – but in our table we would have to update 32 rows – one for each (miserable) Student in that class. And also we will have duplicated values for the Class and Teacher columns in each of those rows.

So what is the solution? Well as it said at the beginning, if you can then split this into two entities relating to a third, but which third? I think it’s hard to give a surefire coverall step-by-step guide on how to solve these problems. I think personally I would say look for the main function or aim – what are we after or trying to achieve? We are trying to organise classes! Possibly you could also look for something that would render all the other parts pointless, so for instance you could take away Students but you could still make sense of the Teachers teaching Class relation, likewise remove Teacher and you still have Students attending Classes, but without Classes, Students and Teachers would not have any purpose. Either way – you identify Class as the third or main seperate entity – so start from here and we end up with a Class table, and then two many-to-many tables, a TeacherClass and a StudentClass. This ensures that we never have >1 multivalued dependency between key and attribute in any of the three entities involved.

Another example LD gives is Sessions with a Room and a Presenter, where the Session might span >1 room. Again I think you can apply the above – but there is some scope for common sense and experience in here.

Finally let’s quickly note that LD points out that many see the 3rd (or Boyce-Codd form) as sufficient normalization. This is probably in no small part thanks to the fickleness of the 4th form and the difficulty of stating the aim in one sweet easy to understand sentence. I think the approach to take is to realise that there are serious problems to avoid (the 32 student rows) and always bear this in mind when you see a 3 or more part compound key.

 

Fifth Normal Form, when you can’t do the above!

So – not all ternary relations can be broken down into our 1-to-2 solution above, fifth form tells us that when this is the case then if it is possible to decompose the relation into seperate entities, we should do so!

How much more vague can we get? I think this is where real world experience becomes essential, so for now let’s just note that people like DL are telling us that 4th and 5th form are important, so next time you are getting ternary make sure to whip out your DL book and scrutinize the hell out of what you are doing.

These are the sort of hard learned lessons that you sometimes feel you have to learn the hard way I guess. But just keep in mind, if you can break it down into something simpler, you probably should!

 

Denormalization

So this is optimizing for performance, LD tells us to never do this during the logical phase (we must remain pure), but to not start crying when we have to do this during implementation.

70-442: Conceptual Data Modeling (LD, 3)

From Chapter 3 of “Pro SQL Server 2005 Database Design and Optimization”, by Louis Davidson (LD).

So this is about understanding requirements (and the old “what the user does not know they want” conondrum and other related problems). Been involved with this kind of work plenty of times and very interesting to hear what and old dog like LD has to say about it.

LD mentions tools like UMN, Visio and MS Solutions Framework. However also states that spreadsheets do quite a good job for just gathering and tracking requrements. Not sure I agree, if you are just going to list stuff do it online in a portal of some sorts. As for his database models he sticks to IDEF1X.

 

Documenting the Process

LG points out the importance of recording and maintaining information about the process and decicions made. Many good reasons why this is a good idea. Looking back at projects I have been involved with I think this was often very much lacking and I think tools like Zoho projects would be very usefu as public facing and so allows the users/clients to be directely involved and kept up to date.

 

Requirements Gathering

So this is where we have the wonderful world of client interview/meetings, existing systems (beware) and whatever else might be brought to the table.

 

Identifying Objects and Processes

Entities generally represent people, places, objects, ideas, or things referred to grammatically as nouns. Other less obvious candidates are things like Events, Documents (Records, Journals). You will also have some system specific entities like Audit Trails etc, but these do not belong in a conceptual model.

Relationships between Entities – here we have of course the one-to-N and many-to-many. LD goes through how you can extract these from reading a specification, don’t think we need the details of this here.

Identifying Attributes and Domains – again you LD goes through how to pinpoint these – quickly the main things to look out for are: Identifiers (NI Numbers, Licence Registration, Serial No – note that these will not neccesarily make keys). Descriptive Information – not surprisingly, Locators, Related Information and Values. Really I think this is beyond what can be taught – most will be obvious and the rest will come from either trial and error or experience.

Identifying Business Rules and Processes – there is of course no straightforward or easy way of making sure you have all the business rules, gererally look out for words like “must”, “will”, “occur”, “have to”. Again – you will probably know. Really I think the key thing here is actively engaging the client beyond just the bare minimum “we want this and that” discussion, or “it used to do such and such”. Again – most times the client’s final goal is often vaguely defined as “making x easier” or “improving y ” and “cutting z”. Ironing out the specifics of what needs to be built is a collaborative process, and the first step for the systems designer (which often times ends up meaning us, the developers) is to get to know your clients business and field if not completely then very, very well!

Finally DL correctly points out that the above very often if not always is an iterative process. It takes more than just one client meeting.

70-442: Data Modeling (LD, 2)

From Chapter 2 of “Pro SQL Server 2005 Database Design and Optimization”, by Louis Davidson.

Louis Davidson (LD) describes the process of converting a list of requirements into a blue print for the actual database design work.

LD recommends the IDEF1X modelling language, but the key thing is finding one you like and learn it. Interestingly he says that the Management Studio Database Diagrams is not a modeling tool – which of course makes sense as this works directly on underlying tables. I have however previosly used this against a mock database and I think this saves time as you can create the core schema, discuss this with colleagues and amend very quickly and then generate your initial CREATE script from this. The problem with this approach is that it tangles the implementation up with the conceptual and logical phases (but in most busy production environments you will find database diagrams a god sent – as a design tool as well!).

At this stage the process is concerned with entities and attributes, not tables and columns – however we are naming stuff and LD gives some good advice on Naming Conventions:

  • Entity names should never be plural (we agree – once you start usign ORMs you will notice this yourself anyway).
  • Don’t use prefixes for neither entities or attributes. Prefixes like “tbl” or “int” (Hungarian notation) is not only ugly but in T-SQL it is mostly quite obvious what you are dealing with.
  • Attribute names should also not be plural – but LD recommends using plural during the modeling phase as a reminder that an attribute covers several details and needs to be split into several later (e.g. Addresses).
  • For all it goes that you should use CamelCase, avoid special_characters and keep it short, sweet and precise.

Modeling Attributes

PK Attributes: At this stage add a simple surrogate key (i.e. a made up key, auto id) and then mark all natural candidates as natural keys (will most likely become indexes).

Alternate Keys: An alternate key is a primary key candidate – so this needs to be unique for the entity.

Foreign Keys: AKA “migrated attributes”. Normally named Id. Louis Davidson’s IDEF1X also allows saving these as resusable specifications, i.e. “SocialSecurityNumber: A character value with a format of ###-##-####”. These are then used when creating columns to save time and enforce consistency across projects.

Domains: A set of valid values for an attribute.

Attribute Naming: Same as entitties, but during modelling leaving a name plural can be a good reminder that you expect multiple values (e.g. Adresses – which will later be split into several columns).

 

Modeling Relationships

Identifying Relationships: Where the child object makes no sense without the parent, e.g. Invoice and InvoiceLine.

Nonidentifying Relationships: E.g. Invoice and Vendor.

Role Names: This is basically just renaming FKs if the parent entity name is too generic.

Relationship Cardinality: denotes the number of child instances that can be inserted for each parent of that relationship.

Recursive Relationships: Nothing new on this really.

Subtypes: special type of one-to-zero or -one relationship, used to indicate the type of a generic entity. This can be a Complete subtype like e.g. a person’s gender, or incomplete like employee type (i.e. they may be type 1, 2 or other). I guess these would often map to Enumerations.

Many-to-Many Relationship: Nothing new on this really.

Relationship Names: known as verb phrases during modelling – taking the time to include these in the model makes this a good source of documentation.

70-442: Database Design Basics

Database Design Phases

Conceptual:  “sketch” of the database from initial requirements gathering and customer information. Business process, scope, and, most important and business rules. Capture this information in a conceptual data model consisting of a set of “high-level” entities and the interactions between them.

Logical: a refinement of the work done in the conceptual phase, transforming this into a full-fledged relational database design that will be the foundation for the implementation design. During this stage, you fully define the required set of entities, the relationships between them, the attributes of each entity, and the domains of these attributes (i.e., the sort of data the attribute holds and the range of valid values).

Implementation: In this phase, you adapt the logical model for implementation in the host
relational database management system (RDBMS; in our case, SQL Server).

Physical: This phase is also more or less the performance tuning/optimization phase of the project – indexes, disk layouts, and so on, come into play here (and not before this).

Relational Data Structures

Database and Schema: You use schemas to group objects in the database with common themes or even common owners.

Domains: The domain of a column is the set of valid values that the column is intended to store.

Keys : Every table should have at least one candidate key—an attribute (or combination of attributes) that can uniquely and unambiguously identify each instance of an entity. After the Primary Key (PK) is chosen, each remaining candidate key would be referred to as an Alternate Key (AK).
Natural Keys are things like NI numbers – danger here is that you do not control the format of these. Smart Keys are keys made up of various bits of information about the entity, nice because can provide users with additional info about entity, but do not guarantee uniqueness. Surrogate Keys or Artificial Keys are made up, e.g. AutoIds or GUIDs.

OnLoad vs Page_Load vs .Load delegate

Excellent article on this here: http://weblogs.asp.net/infinitiesloop/archive/2008/03/24/onload-vs-page-load-vs-load-event.aspx

Looked this up because I always wondered when and if you should call the base.OnLoad (or whatever page event it is) when you override these.

Which of course leads to the question should you override, attach to the event or just use the AutoEventWireup feature.

First of all the OnLoad even is responsible for firing the Load event (and still ditto any other page event) – So just to be clear: if you do not call base.On[Event](e); then the corresponding event will not fire! However this actually gives you a lot of control as you are able to do things before/after the event – which is useful for you standard PageBase.cs (i.e. check we have all required environment data and then fire the relevant page event).

Secondly I like the author’s general take on this – obviously all three options are viable and will do the job – so really it comes down to what you prefer.

Ive never personally liked the AutoEvent black box magic, where markup in the frontend determines what happens in the code behind – however this does come straight out of the box and saves you having to maintain a load of += delegate statements in you OnInit event. Also – even though it might not be pretty –  it’s a very well known convention that you will find the vast majority of .NET web devs are using,  so you will probably just inconvenience  yourself and others (spare a thought for the poor bugger who is going to inherit your code some day) if you start getting all purist on this.

My recommended solution – let VS generate it’s lovely little AutoEvent tags and markup and then use override onload in your base page/controls to gain some finegrained control on what happens when.