Warning: Creating default object from empty value in /home/johnnz/public_html/wp-content/themes/simplicity/functions/admin-hooks.php on line 160

Unique Keys are the Primary Cause of Duplication in Databases

Data Quality Paradox

It is perhaps the greatest paradox of data quality that the primary cause of duplication of entries in any database is the mistaken use of unique keys as the unique identifiers of records.

The problem starts in the data modelling stage of analysis when data analysts use codes as the unique identifiers (UIDs) of data entities.

Codes Are Not Identifiers

Data Modeling 101: The unique identifier of a data entity is NEVER A CODE!!

This may cause many data analysts to have apoplexy, as it is a practice that they have followed most of their lives.  But simple logic can show it to be incorrect.

For something to be a unique identifier it must first of all qualify as an identifier. Codes do not identify!  Don’t believe me?

A Simple Test

If I take you to the middle of a warehouse with 500 unlabelled parts laid out on the floor and tell you to go and pick up part 12562 you would be unable to proceed.  Why?  Because 12562 does not in any way identify any of the parts.  If I say “pick up a motor” then you will be able to make a start.  You might need to ask “an electric motor or a hydraulic motor?” but you will be able to identify a motor among the parts.

We see from this that codes are not identifiers. If they are not identifiers then they cannot be unique identifiers.

What is a UID?

Unique identifiers enable us to answer the question; “What is it, with respect to this business, that makes one occurrence of a data entity uniquely different from every other occurrence of that data entity?”  If we know this then we will have the unique identifier for that data entity.

So lets ask the question with regard to a finance company that loans money for the purchase of motor vehicles.  “With regard to this business, what is it that makes one customer uniquely different from another customer?”  For the purposes of this example we will stick to customers who are individuals, as opposed to corporate customers.

Straight away, some data analysts will shout out “it’s the customer number!”  Wrong!

Example UID Problem

To demonstrate this I could take you to many finance companies and show you the customer data files with records that would look very similar to the following.

Cust No First Name Surname Street Town D.O.B
1001 John Smith 123 High St Bristol 22 Dec 1950
1002 John Smith 123 High St Bristol 22 Dec 1950
1003 John Smith 123 High St Bristol 20 Aug 1771
1004 Mary Jones 27 West St Walsall 12 May 1975
1005 Mary Green 27 West St Walsall 12 May 1975
1006 Mary Jones 22 Grove St London 12 May 1975

Every customer has a different Customer Number, but do these six records represent six distinct customers?

It is highly likely that customer No 1001 and 1002 are the same person.  But what about 1003?  The same person again with an error in the date of birth or a younger person of the same name at that address; the son of the older John Smith perhaps?

Looking at Customer Numbers 1004 and 1005, we ask is Mary Green the same person as Mary Jones – now married to a Mr Green perhaps?  Or is she an entirely different person who just happens to have the same date of birth?

What about Customer 1006? Is this the same Mary Jones who has moved house?  If she had a bad debt record at 27 West St would you want to do business with her at 22 Grove St?

Data Uniqueness is a Business Decision

This set of records forces us to repeat the question; “What is it, in the context of this business, that makes one customer uniquely different from every other customer?”

One thing that is for certain is that this will NOT be Customer Number!

Uniqueness must be defined by the business. It is the job of data analysts to help the business arrive at this definition.

It is the job of database designers to implement this uniqueness in a manner that prevents duplicate records being created in the database.

If the business has not defined uniqueness from a business point of view and the data analysts have not modelled this, then it is impossible for database designers to implement uniqueness.

Unique Keys are Not UIDs

The use of the word “unique” in the term “unique key” and in “unique identifier” at some point in the past got completely misinterpreted by data analysts and database designers, in that they thought that unique keys were a mechanism for implementing unique identity in a database.

Because of this, what they did was to introduce to every database a mechanism that would, above all else, allow duplication to occur.

Amazingly, this error is still practiced worldwide on a daily basis, and not just by those analysts and designers who made the mistake many years ago but also, sadly, by those coming new to the data analysis and database design.

Stop it Now!

Removing Duplicate Records

How do you remove all of these duplicate records from your database?  Buy a clever piece of software that will do it for you?  You could, but it would simply be changing the problem not solving it.  You would either end up with a database with no duplicates removed or with a whole lot of records merged that should never have been merged.  Try to sort that one out!

The fact is that, if the business has not defined what it sees as the elements that make a data entity unique, then no piece of software can.

A full description of how to model UIDs for data entities is contained in the eBook:

IMM Data Structure Modelling

available from the Online Store.

For a quick introduction to the fundamentals pick up the Rapid Guide to Data Structure Modelling also available from the Online Store

Notes

The data shown is only a short, simplified and denormalised example, but is typical of data held by various finance companies for whom I have done consultancy.

I am not making an argument against unique primary keys in tables, as these are essential, but against confusing unique keys with unique identifiers.

Tags:

8 Responses to “Unique Keys are the Primary Cause of Duplication in Databases”

  1. Dave November 27, 2010 1:40 am #

    It seems that we are talking about two different things. The UID in the example could be used inside of a database; however, to refer to a customer by this UID outside of this DB context would not be helpful.

    • John Owens November 27, 2010 3:23 pm #

      Hi Dave

      Thanks for your comment.

      The main points of the article are to show:
      o That the Unique Identifier (UID) of a Party will NEVER be a code!
      o That it is the use of unique keys in a database that enable duplication.
      o That the Unique Identifier of a Party (whether they are playing the role of Customer or Supplier) must be known and defined at the BUSINESS level – not at the database level.

      It is at the database level that the business defined UIDs must be implemented in a manner that ensures that no duplication can occur and this will never be achieved by using a code!

      Regards
      John

  2. Sanja June 8, 2010 9:59 am #

    Fantastic article! So true!

    • John Owens June 8, 2010 3:24 pm #

      Hi Sanja,

      Thank you for your kind remarks.

      John

  3. scott September 1, 2009 3:06 pm #

    Good article, John

    I would take this one step further by stating that using nonsensical numbers as identifiers is a usability issue. That is, if the actual users of such identifiers were consulted in the database design process, the users could indicate the best identifiers for ease-of-use. The best identifiers would also likely be “quality” identifiers in that their ease-of-understanding would be in line with actual users of the data.

    Just brainstorming.

    • John Owens September 5, 2009 10:31 pm #

      Thanks, Scott,
      The identifiers you refer to are actually what I call QUACKs (Quick Unique Alternative Code or Keys) in the Integrated Modelling Method.

      These can be a very useful business mechanism for easily referring to data and business entities but are NOT unique identifiers (UIDs) and are only useful when implemented in parallel with UIDs.

  4. David Walker September 1, 2009 10:57 am #

    John,

    For large scale data warehouses we have worked around this problem by maintaining a ‘checksum key’ In our case we have a standard implementation of a combination of MD5 and MIME64 (Link above). This allows us to generate an anonymous code that is unique but still relates to the data. It also significantly speeds up loading and validation on load routines

    rgds
    davidw

    • John Owens September 1, 2009 11:18 am #

      Thanks David
      It would definitely make sure that the data in each customer record was unique but not that each related to a unique customer, unless the business had previously enforced this by defining (and implementing) the unique identifier for customer.
      John

Leave a Reply