|
Published in DM Review, Vol 8, No.2, February 1998 Copyright © 1998, Larissa T. Moss, Method Focus Inc. "Dirty data" is such a pervasive problem in every company, in every industry! But we have lived with "dirty data" for decades, so why is it such a problem now? Is it because we promise to deliver data warehouses with "clean, integrated, historical data in a short time frame for low cost", yet we are unable to deal with the preponderance of "dirty data" within the framework of this promise? Some data warehouses are failing because the promised "clean", "integrated", "historical" data could not be delivered. Others are failing because the promised "short" time frame and "low" cost were exceeded in the attempt to clean up the data. In other words, they are failing because of the dichotomy of our promise. How could we, the consciencious IT professionals, have allowed "dirty data" to happen in the first place? How could the users, the owners of the data, have allowed it to happen? Or, are there legitimate, sometimes even justifiable, reasons for the existence of "dirty data" in our legacy systems? And what should we do with it now? The first blame for this "disgraceful" condition is usually put on the inadequate and sometimes non-existent edit checks of our operational systems. And the second blame immediately follows, pointing to sloppy data entry habits. Admittedly, a large percentage of "dirty data" is generated inadvertantly and not caught by the edit checks. But an equally large percentage ends up in our legacy files quite purposefully entered.
Dummy values
In some instances we find dummy values that actually have meaning, for example a social security number of 888-88-8888 to indicate a non-resident alien status of a customer, or a monthly income of $99,999.99 to indicate that the customer is an employee. If you had to calculate the average monthly income of all your customers, the results would be flawed.
Absence of data
Multi-purpose fields
Upon closer examination, the size and content of such records can be anything from a string of dates, redefined as a string of amounts, redefined as a mixture of text and numbers, redefined as ... How does one end up initializing such "kitchen sink" records? With Low Values of course, or how about with Spaces. "No matter", you may say, as long as the redefined fields are always populated properly according to the rules of the record type. "Yes it does matter", I say, because they never are! But let's not stop here. The reality is, that once we are up to 25 redefines, many of the first few redefined data values are no longer applicable, and in many cases no one is left to even remember what they had meant at one time. So, you launch a "clean up campaign" advising all users of the system that you will delete these bad values at some specified point in time. Much to your surprise, you get 2 panic phone calls and 1 puzzled e-mail response from people you did not even know were using this system. You just discovered that somebody had been using this field for yet another purpose to perform their business function without informing anyone about it, and removing the "bad values" will all but shut them down, and that somebody else had been misinformed about the 25 different meanings of the field and had been misreporting critical financial information based on these "bad values".
Cryptic data
For example, let's say that originally our hypothetical loan system tracked whether or not taxes were impounded on a mortgage loan. The code "I" may have been used to mean No, there is No [tax] Impound account. Let's say that years later we started to impound for insurance. Since the code "I" was already being used for No Tax Impounds, "T" seemed like a "logical alternative" for No Insurance Impounds - all we'd have to remember is that they are both negative and switched. After a few more years we wanted to track certain FHA loans, which required no insurance. The code "T" was already being used to mean No, there is no insurance impound, but it had a different meaning from the new requirement, so the new code "F" was added. And, since we seem to be tracking "exceptions to payment processing" in this field, let's add "E" as a catch all indicator for any and all exceptions. Oops, there are loans for which we need to know that neither Tax nor Insurance is impounded. Although we have not run out of the alphabet yet, let's keep the field blank and write logic to interrogate the Impound Balance field, and if there is a value greater than ZERO, let's check to see if an Impound trailer record exists, otherwise the Impound Balance field has a completely different meaning. Another few years go by, and ... we all get the picture. WHY you may ask? Most certainly there was a logical reason at the time, but only those who worked on it 20-30 years ago know what it was.
Contradicting data
Inappropriate use of address lines
Address line 2: TORNEYS Address line 3: 10 MARKET, SAN FRANC Address line 4: ISCO, CA 95111
Violation of business rules
Reused primary keys
Non-unique identifiers
Data integration problems
The far more serious problem is when data that should be related cannot be related. This is sometimes linked to the previous problem of non-unique primary keys, but more often it is due to the absence of any keys. A common example of this can be found in banks. All banks assign unique account numbers, but few banks assign unique customer number. For decades the customers have been associated with their accounts through a customer name field on the account record. When we examine all the account records that belong to one customer, we find different spellings or abbreviations of the same customer name, sometimes the customer is recorded under an alias or a maiden name, occasionally 2 or 3 customers have a joint account and all of their names are squeezed into one name field with rather cryptic results. Unless all this information is tracked manually, which it seldom is, especially if the customer has accounts in multiple branches, analyzing customer profitability is all but impossible. Now that we have exposed these "dirty data" nightmares, let's hold back on our urge to lynch the guilty, but let's instead be fair to the users and the IT staff who creatively created this mess. In most cases this type of creativity was, and under the prevailing circumstances maybe should have been, rewarded. Seriously, what are the obvious alternatives? Change the files and the programs, of course! However, if the system is a purchased package or it's a home-grown system over 30 year old, maybe even written in Assembler, this "simple" solution may neither be simple nor cost effective at all.
To Cleanse Or Not To Cleanse ...
The next question is more difficult: SHOULD it be cleansed? Here too, the answer is often NO. For most of us in data management this answer smacks of heresy. What purpose would it serve to suck "dirty data" out of the operational systems and plunk it into a data warehouse as-is? Obviously, none. Clearly some data cleansing must occur. However, we all have to face the business reality of today, and the business expectations to deliver value added information in a relatively short time frame for a low cost. Although "short time frame" and "low cost" are subjective in each company and can represent a wide range of interpretation, it should be obvious that the time and effort it would take to create and test the long and complex logic to fix some of our worst "dirty data" cases would fall outside that range. Once we decide which data should be cleansed, the question is: WHERE do we cleanse it? Do we clean up our operational data on the operational systems? Or, do we perform the cleansing transformations in our extract and load processes for the data warehouse? Usually our first reaction is to clean up our operational systems, and in some cases that can and should be done. However, all too often those who are using the operational systems for operational purposes do not need the data any cleaner than it is, and they resist any attempts to change it. And in many cases it may be legitimately too labor intensive, not cost effective, or simply impossible to do, and so we end up putting the burden of cleansing on our extract and load processes. The final question of course is: HOW do we cleanse what can reasonably be cleansed? Can the data cleansing products on the market today handle a lot of the common data quality problems shared by most organizations? Quite obviously, the answer is YES. But are the cleansing products on the market capable of resolving all of these very complicated and very customized "dirty data" situations? Quite obviously the answer is NO. If you are truly serious about creating value added information and knowledge above and beyond the condition of your operational data, will you have to bite the bullet and write procedural code? The answer is definitely YES.
What are the steps?
The answers to these fundamental questions must come from the users, not from IT. Certainly, the expertise and involvement of the IT staff will help the users identify, quantify, document, and analyze their business needs, but it is the users who decide why a data warehouse should be built, not IT. Once the goals and objectives for the data warehouse have been established, and it is clearly understood what business questions cannot be answered today and why, it is IT's responsibility to analyze the existing operational files and to locate, document, and report the discovered "dirty data" to the users. Now the painful task begins in determining where to get the "biggest bang for the buck". Users and IT together, must evaluate the tangible and intangible benefits for each business question that could be answered by a data warehouse that cannot be answered today. Together, they must understand how each "dirty data" case would prevent these business questions from being answered, and they must understand the effort involved to cleanse it. Now the compromising begins. If the benefits outweigh the costs of the effort, the data should definitely be cleansed. Now the decision must be made whether or not to make the necessary changes to the operational systems to (a) clean up the existing data and (b) to prevent future "dirty data" from being entered. All efforts should be made to improve the operational systems, unless that effort is so unreasonably high, for instance in the case of multi-volume historical tapes, or it simply cannot be done because the original data sources no longer exist. The reality is that, more often than not, the majority of cleansing ends up being done in the extract and load processes. If the costs outweigh the benefits, another painful decision must be made: Should the "dirty data" go into the data warehouse as-is, or should it be left out? Again, users and IT together, must weigh any possible benefits that could be derived from including this data, dirty as it is, with any possible harm it could do, such as skewing results of an important trend analysis thereby rendering it useless, or worse, providing wrong information which leads to bad business decisions. This brings us back to the very essence and also to the dichotomy of data warehousing: the promise to deliver "clean", "integrated", "historical" data in a "short" time frame for "low" cost. After we fully comprehend the "realistic" time frame and "realistic" cost required to achieve the first part of the promise, we realize that we have been mislead by the second part of the promise.
|