Warning: Creating default object from empty value in /home/johnnz/public_html/wp-content/themes/simplicity/functions/admin-hooks.php on line 160

Data Quality – One Version of the Truth?

The concept of “one version of the truth” is possibly the most widely discussed (and disputed) topics in Data Quality.

Some say that one version of the truth can never exist, others, that it must always exist or there is no data quality.

There is one version of the truth but, perhaps, not the truth as you know it!

One Version of the Truth? I Wouldn’t Start From Here

There is an old story told about a visitor travelling around Ireland who loses his bearings and stops to ask a local for directions.

“Excuse me.” He says, “How do I get to Dublin from here?”

The local looks very concerned, shakes his head, and declares in a serious tone, “Well, to be honest with you, if I was going to Dublin I wouldn’t start from here!”

This story can be seen merely as a demonstration of “daft” Irish logic or, as I have often found in analysis and modelling, a very powerful warning against starting off in entirely the wrong place, which, sadly, is the starting point for most analysis projects.

One very wrong place to start in when moving to one version of the truth is by building standards based on loose, rather than unambiguous, terms and on analogy rather than on analysis.

One Version of the Truth? Say What You Mean

When people in the Data Quality world use the term “truth”, what do they mean?  Do they really mean truth or something else?

One version of the truthIf something else, then why not say what that is,using plain, unambiguous terms?

If they do mean truth, then they must clearly define what “truth” means to them.  Not to them personally but to the enterprise to whom the data belongs.

Is “lying” the opposite to “truth”.  If their data is not “truth” is it a lie?  Are they saying that their data lies? When they enter “red” as the value for Colour, does the database show “green”.

To those who say, “we have many versions of the truth”, does “red” sometimes appear as “pink” sometimes as “purple” and other times as “indigo”?

Is “many versions of the truth” merely a euphemism (or is that a lie?) for “we have no idea of what our data will tell us, but we are hardly going to admit that!”?

Two essentials requirements when moving to one version of the truth are 1) a sound understaning of the fundamentals of data quality and 2) the use of a terminology consistent with those fundamentals.  What do I mean by this?

Firstly, if you are going to use terms like “Master Data” and “Transactional Data”, then make sure that you fully understand what these are.  Secondly, having grasped these fundamentals NEVER talk about Customer as Master Data!

If you are working in an enterprise where Customer is truly a master data entity, then you are working in an enterprise so small and with so few applications that you will have few data problems.

It’s All in the Relationships

Customer, Supplier (and, in some enterprises, Employee) are terms that describe a trading or contractional relationship that an enterprise has with other legal entities.  The nature of the relationship is derivable from the transactions that the enterprise has performed that are linked to that legal entity or “Party”.

For example, the relationship of “Customer” exists when the enterprise has made a sale to a Party.  “Supplier” indicates the enterprise has purchased a service or product from a Party.

UID Defines “One Version of the Truth”

Data Quality practitioners, instead of talking in analogies and euphemisms, must get back to (or learn for the first time) a few key fundamentals. Those in search of the “one truth” – and that should be everyone – need to learn about the Unique Identifier or UID.

It is a lack of understanding of precisely what the Unique Identifier of an entity is (and the consequential failure to properly implement this) that is the primary cause of duplication in databases worldwide.

So what is a UID?  This is easily established for any entity by asking the following question, “what is it, with respect to this enterprise, that makes one occurrence of <data entity> uniquely different from every other occurrence of <data entity>?”

And, just to warn you in advance, the UID of an entity is NEVER a code!  See Unique Keys are the Primary Cause of Duplication in Databases to see why.

Comments?

Please feel free to leave a comments below or Tweet this post and share it with a colleague.

Let me know what you think about One Version of the Truth.

5 Responses to “Data Quality – One Version of the Truth?”

  1. Helen March 22, 2011 6:35 pm #

    I think I understand what you are saying but am struggling to understand how to apply it in my environment. Government is very complex and the phrase “One view of the customer” is thrown about all over the place. I understand the difficulties of achieving this and I also understand the difficulties of producing a UID to integrate all the data that exists.

    For instance – in the past we have created our own UID for each resident as everything else, aside from their date of birth, is subject to change. However, you say that we should not use ‘code’. What is the reasoning behind this and what can be done better? We want to start integrating our systems with other organisations and need to agree on a UID between them. Is the answer to have a UID per system integration and then one generated UID in the Master Data Repository? We have previously wanted to use their national identity number/social security number but apparantly this contravenes data protection laws when we start integrating systems that carry data outside of government.

    • john March 30, 2011 10:06 am #

      Hi Helen

      The main reason you should not use codes as your UID is that, contrary to expectations, they are the major cause of duplication in databases worldwide. Read Unique Keys are the Primary Cause of Duplication in Databases to see why.

      As I point out in that post, codes do not identify and, if they do not identify, they cannot uniquely.

      To find the unique identifier for any entity you must be able to ask and answer the question, “What is it, with respect to [enity name], that makes any one occurrence of it uniquely different from every other occurrence of it within our enterprise.”

      This this will be a combination of one or more attributes of the entity and, quite often, relationships with other entities.

      I hope that reading the other posts helps to clarify the reasoning.

      Do not hesitate to coanct me.

      I am willing to have a Skype call with your team if it would help.

      Regards
      John

    • Rock July 7, 2011 10:37 pm #

      There’s nothing like the relief of finding what you’re lkoonig for.

  2. Richard Ordowich November 27, 2010 1:54 am #

    UID’s are subject to the same ambiguities as customer, supplier etc. What is the UID represent? A unique “Customer” or “Supplier”? If Customer and Supplier are ambiguous then how could the UID be unambiguous? If a UID is assigned to a “Prospect” does that imply the “Prospect” is a “Customer”? To a salesperson that may be the case but to finance the role is the defining attribute that determines if this entity is a Customer.

    All data is subject to this ambiguity and so the “truth” becomes elusive even for UID’s. UID’s are just another data attribute that is defined and the rules governing that data are “agreed” to by the majority of stakeholders who use this data. However it still doesn’t resolve the dilemma of what defines a customer? The question: ”what is it, with respect to this business, that makes one occurrence of a data entity uniquely different from every other occurrence of that data entity?” is not resolved with a UID. Unless of course each instance of the entity has its own unique idea, then perhaps you get closer to the concept of a UID.

    Therefore a UID is a classification of a group of entities sharing similar but not necessarily identical characterizes. As a result this is not really the one version of the truth since you cannot distinguish each entity as unique. UID’s as defined in this article are an “acceptable generalization” to identify entities needed to run the business. To be more precise they should be called “Generally Accepted Identifier” not “Unique Identifier”.

    • John Owens November 27, 2010 3:08 pm #

      Hi Richard

      Thanks for your feedback.

      By their very definition, Unique Identifiers MUST uniquely identify the entity to which they relate.

      Any enterprise that cannot define Unique Identifiers for its entities is in trouble and, as can be evidenced by the amount of duplicate data out there, there are many of these out there.

      Many enterprises further compound their confusion by trying to define Unique Identifiers for entities that do not exist, such as “Customer” and “Supplier”. As my article shows these are merely derivable relationships with other legal entities.

      Answering the question, ”What is it, with respect to this business, that makes one occurrence of a data entity uniquely different from every other occurrence of that data entity?” is the only way for an enterprise to establish what its Unique Identifiers are.

      Remember, Unique Identifiers will NEVER be codes! They will be a combination of one or more attributes and, possibly, one or more relationships.

      Having defined what the UIDs are, the next step is to define how these will be implemented within a database. Once again, using a code for this purpose will always fail.

      Regards
      John

Leave a Reply