Published in The Navigator, Spring 1999
This article is excerpted from a book tentatively titled, Data Warehouse Project Management (Addison-Wesley, Spring 2000). All material is copyright Addison Wesley Longman, Sid Adelman and Larissa Moss.
Data cleansing is a "sticky" problem in data warehousing; we call it a Dichotomy in Data Warehousing. On the one hand a data warehouse is supposed to provide "clean, integrated, consistent, and reconciled data from multiple sources" and on the other hand we are faced with a development schedule of 6-12 months. Obviously, if we do one we fail the other, and if we stick to the other we fail on the first. - The answer lies somewhere in between. The difficult part is to determine just exactly where in between.
We use the following guidelines for both wholesale data warehouse (DW) and retail data marts (DM):
- Never try to clean ALL the data
Everyone would like to have all the data perfectly clean, but nobody is willing to pay for the cleansing or to wait for it to get done. To clean it all would simply take too long. The time and cost involved often exceeds the benefit. More on that in section (5).
- Never clean NOTHING
Sorry for the double negative, but basically it means: always plan to clean something. Afterall, the whole point for building the data warehouse or data mart is to provide cleaner and more reliable data than you have in your existing OLTP or DSS systems. If we simply use the old "Garbage In/Garbage Out" method, we are just adding one more system to our data problems. We'll call this the GIGO method.
- Always reconcile EVERYTHING
We know that a DW or a DM are not like a General Ledger, and they do not have to balance dollar for dollar. That is true. But for the DW or DM to have credibility, each dollar (whether you include it in the DW/DM or not) should be explainable. Therefore, creating reconciliation reports between the Source Feeds and the DW/DM Load Files is a MUST.
Here are two more reasons:
- It will reduce testing time for the development team if there are automated reports which account for every input record and every output record and every dollar and count processed by every Extract/Transform/Load program,
- The users will have proof that every record is accounted for, regardless whether you write all the records to the DW/DM or whether you accept all the fields into the DW/DM.
- Clearly spell out the goals and objectives of the DW or DM
How do you determine what data to cleanse, what data to load GIGO, and what data to leave out entirely? You need to look at your DW/DM goals and objectives.
Examine the reasons for building the DW or DM in the first place.
Here are some questions to ask yourself:
- Why are we building this? - or put another way:
- What is deficient in the current DSS environment?
- Do we have inconsistent reports?
- What is the cause for these inconsistencies?
- Is it dirty data or is it program errors?
- What dollars are lost due to dirty data?
- Which dirty data?
- Determine the cost for cleaning the dirty data
Before you start cleansing all the dirty data you've identified, you must determine the cost for each dirty data element. Ask yourself how long it would take to:
- analyze the data
- figure out the correct data values and correction algorithms
- write the programs
- correct the old files (if appropriate)
Now calculate the cost by multiplying the time by the fully burdened rate for the individuals who would be performing these tasks. You may also consider the cost that continues to occur for not solving other business needs that these individuals would be working on, if they weren't cleansing the data.
- Compare cost for cleansing to dollars lost by leaving it dirty
Everything done in business must be cost-justified. Unfortunately that applies to cleansing data as well. We say unfortunately, because we purists would like to see things fixed just for the principal of it. But businesses don't profit from principal, they profit from making wise spending decisions. For data acquisition/ data cleansing that translates into:
- if $ lost exceeds cost of cleansing, it goes on the "to be cleansed" list,
- if cost for cleansing exceeds $ lost, it goes on the "don't cleanse" list.
- Prioritize the "to be cleansed" list
Even though you have eliminated some of the dirty data from cleansing because it is not cost justified, you still may have too many on your plate. Rather than starting with the top of the list and quitting when you run out of time, you should now prioritize your "to be cleansed" list.
Put those dirty data elements that will return the highest business value on the top of the list, followed by those that will return a modest value, and by those that will return minimal value. Should you really run out of time on your project, you and your users would be assured that you maximized your cleansing effort.
- For items on the prioritized list ask: Can it be cleansed ?
You may have to do some data analysis to find out whether the "good data" still exist anywhere. Sometimes the data values are so convoluted, that you may have to find some "old-timers" who still remember what all the data values mean, and to help you write the transformation logic.
Then there will be times when, after several days of data analysis, you find out that you couldn't cleanse a data element even if you wanted to. You now have two choices:
- you take it off the "to be cleansed" list and load it GIGO,
- you do not include this data element in the DW or DM.
- What are the remaining goals and objectives?
Regardless what you decided to do with the data you cannot cleanse, you must do one more thing: Review the goals and objectives of the DW/DM.
Do the remaining goals and objectives still warrant proceeding with the DW/DM?
If the primary reasons behind the request for a DW or DM were traced back to dirty data problems, and you find that most of your dirty data
- is not cost justifiable to cleanse, or
- cannot be cleansed - period,
you need to seriously reconsider building the DW or DM. To build a costly DW or DM just for the principal of having one is as unwise as cleansing all the data just for the principal of it.
- Document all your data cleansing decisions as meta data
All the decisions you make along the way for all the dirty data you find (whether you cleanse it or not) must be documented as meta data in your repository:
- the degree of "dirtiness" (either by percentage or number of records)
- the dollars lost due to wrong or inconsistent reports caused by it
- the cost for cleansing it
- your decision to
- cleanse it
- load GIGO
- not to include in DW/DM
- Be sure the project plan reflects this data cleansing effort
We recommend the above steps as one of the first tasks on a project so that all the data cleansing activities are accurately estimated and reflected in the project plan.
We further recommend that these decisions are not made by the IT project manager alone. The users who fund the project must make these decisions, and the IT project manager must point out the pros and cons of each decision.
Although these 11 guidelines sound easy and straight forward, the process of weighing the cost versus the benefit, and deciding whether to choose easy GIGO loads and difficult data retrieval as opposed to difficult cleansing loads and easy data retrieval, is neither easy nor straight forward. You are basically choosing when and how you want to pay, but pay you will.
|