Warning: Creating default object from empty value in /home/johnnz/public_html/wp-content/themes/simplicity/functions/admin-hooks.php on line 160

The Pitfalls of Data Re-Use

I have been reading some posts lately on data re-use, some suggesting that it was an effective method of incrementally improving data quality.  Data re-use, though seemingly an eminently sensible practice, is fraught with danger. The only data that is truly safe to re-use is genuinely raw data. All other data re-use should come with a health warning.

I first came across the reality of this in the world of oil and gas exploration. In this world, even departments in the same enterprise would not use data from other departments. I first viewed this as short sighted and typical of the “silo” mentality that pervades many large organisations.

However, when speaking candidly with one very enlighten exploration specialist I found out why. “We just can’t trust it. We do not know what has been done with it.” he explained. This was true. People would collect raw data and would carry out all sorts of manipulation to arrive at some conclusion critical to their task. They would then, quite generously, offer this data to the rest of the enterprise and would feel quite spurned by the rejection of their offer.

The trouble was that what they were passing on was their manipulated data, not the raw data that they had collected. Now, exploration experts know how hard it is to get valid results through properly interpreting quality raw data. They have also learned, through painful and costly experiences, that this is impossible to achieve with manipulated (in effect, corrupted) data.

This is the case in any enterprise, whether we are talking about seismic data or “customer” address files. Once the raw data has been manipulated then it is impossible to ascertain the quality of that data, because one man’s “deduplication” is another man’s corruption.

The world of digital photography has learned this lesson, where all good photographers keep a “digital negative” of the original and always work on a copy of this. Most will also work in RAW format as opposed to JPG, as the former captures the raw data ‘as is’, while JPG does some manipulation even in the capture process, thus already distorting the data.

Also, each time you edit a photo in JPG format, it gets even further degraded as unwanted artefacts get built in. Even when you do manipulation that is intended to improve the image, such as changing contrast or sharpening, the underlying data structure is degraded. So, instead of converging to a ‘perfect’ picture the digital structure moves further and further away until, at some point any further manipulation will totally distort the image.

So it is with business data. With each manipulation, unwanted artefacts get built in to the data until, at some point, any further manipulation will cause real corruption.

With a digital image the distortion of the data will be all too evident. With business data the distortion will be hidden, apart from the fact that the data, which has undergone constant ‘improvement’, is now throwing up unpredictable results. Most times these results are seen as programming errors. However, no amount of debugging seems to remove them!

Breaches of Fifth Normal Form are a prime example of how errors get built into data, not through incorrect data values, but through unknowingly creating flawed data structures. No amount of programming can remove a breach of 5NF!

Does all of the above mean that you can never re-use data? No. However, it does mean that you have to be aware of the dangers of using manipulated (whether termed ‘deduplicated’ or ‘cleansed’) data. Every manipulation has the possibility to introduce unwanted (and invisible) artifacts that degrade rather than improve the quality of the data. When such artifacts are present, any further manipulation will further degrade the data quality.

If you liked this post on data re-use then please Tweet it and share it with your friends and colleagues.

2 Responses to “The Pitfalls of Data Re-Use”

  1. Richard Ordowich April 21, 2011 12:21 pm #

    Although the reuse of data does necessarily result in improved quality there are instances where it does such as e-mail addresses. E-mail addresses get stale. If e-mail addresses are reused and someone discovers it is wrong, they can then update the address which will then be used by others. The quality improves with reuse.

    I think the premise that genuinely raw data is safe is ill advised. Data is a representation of some entity. Who’s to say the representation is correct? Even raw data can be bad! Unless there is a process to verify, validate and certify the data, all data should be suspect.

    • john April 22, 2011 12:08 am #

      Thanks Richard

      Updating e-mail addresses, telephone numbers, etc. is a task that can be done safely, with two provisos, 1) that the data structures of the enterprise enable an audit trail to be kept to see what was changed, when and by whom or 2) the enterprise data quality standards state that for such items no audit trail is required and that the last value entered is deemed to be the correct value. Case 2) is, sadly, the case in most enterprises and some might say that it does not represent quality at all.

      The real problem is not altering the values of attributes, rather removing records and altering structures. These are the the activities that have the highest probability of introducing unwanted structures or artifacts.

      Data can be progressively improved only if a) all changes conform to a properly constructed Logical data Model, b) there is a full audit trail for all changes and c) the data is “baselined” at regular intervals. Each new baselined version would equate to a quality proven “raw” data version.

      Regards
      John

Leave a Reply