Home‎ > ‎Database‎ > ‎

Information cleansing or scrubbing

a procedure that removes and/or corrects inaccurate information.  Also known as "data cleansing/scrubbing", the procedure is mostly used in databases to track inconsistent data which is also known as "dirty data".

What is its significance?


Information scrubbing came about when businesses wanted to improve the quality of achieving accurate date.  In correspondence, the higher the data quality, the lower the costs to fix the misinformation and drain on profits.   Data quality is data that must meet a certain critera to be accepted.  Here are a couple of examples of what data quality needs to have:

  1. Accuracy - a consensus that the data is correct
  2. Completeness - data values present meet the specific requirements
  3. Consistency - the data is free from variations
  4. Uniqueness - data recorded is represented in a special way like a primary key
  5. Timeliness - data values and records are up to date
  6. Validity - the quality of inputted data meets the requirements for accurate identification
How does it work?
Before computers could develop the capabilities to do information cleansing, most of it was done by hand.  As a result, manually doing information scrubbing was time consuming, expensive, and prone to more error.  However once computers had the capabilities, companies could either create their own software version or buy from companies that specialized in information or data cleansing.  Though there may be different verisons of data scrubbing, they follow the same procedure:
  1. Data Auditing - uses statistical methods to identify anomalies and their location when checking data’s efficiency
  2. Workflow Specification - the second process in information cleansing deals with how to find and eliminate errors.  The workflow uses a series of operations when detecting and correcting the anomalies to achieve high data quality
  3. Workflow Execution - the third step is commencing the suggested operations in the previous step to achieve high data quality.  However, workflow execution is also the step that many companies conduct trade-off analysis due to the high expenses in conducting this phase of information scrubbing
  4. Post-Processing and Controlling - The last step is the verification and results from the workflow execution.  Once here, any remaining errors that were not eliminated by the previous step can be done manually.  From this state, the data can be rescrubbed for further termination of “dirty data”
Here are some of the different methods used for information cleansing:
  1. Parsing - used to find a collective set of inconsistent data that does not meet appropriate specification
  2. Data Transformation - details data into a format that will be used for a specific function
  3. Duplicate Elimination - with the use of an algorithm, searches and deletes data that is inputted more than once for the same database.  This increases the data quality by reducing redundancies
  4. Statistical Methods - with the use of statistical analysis and values, values that are unexpected or missing can be found and replaced by possible data values
 If a firm chooses to buy commerical data cleansing software, they can get expensive.  The price ranges for workflow execution software goes between $20,000 to $300,000.  From this standpoint, many companies debate the cost-benefit analysis of purchasing the information scrubbing software comparied to creating their own.  Some companies choose to either create their own or ignore the data cleansing process altogether due from the high costs that would impose on their company.  Despite the attempt to lower expenses to achieve short-term savings, most companies impose long-term losses when the errors are not corrected.

How do errors occur?


Data that incurrs errors, also known as "dirty data", can occur in a couple of ways.  One is by human error such as customers not putting the right information into a database.  Another type appears when transferring information from one database to another.  A likely result from this would be from two databases using different formats.  Branches of a company may not have the same input standards which would mean at least two branches of the same company have different methods of putting data into their databases.  Finally, errors can occur when old systems that have inconsistent and outdated data are not updated.


The overall result of compiling all these types of areas can be significantly costly to a business.  For example, take a firm that has inputted some information into their database and a couple of errors occurred.  Even though there may be small errors in one database that might seem unimportant to the firm, if merged with others the errors can multiple.  Now lets say that there are a multitude of databases and all have small errors inside.  When someone wants to converge all of them into one large database such as a data warehouse, all the small errors or “dirty data” multiple into the one massive database.  As a result, instead of having small errors in a lot of databases, one database has a many errors which brings down the quality of data represented.  If people or other businesses see inaccuracies in a database they use for their own business, they see the incorrect information as potential losses for them in sales and market share.


What are the Challenges and Problems using Information Cleansing?

  1. Error Correction and Loss of Information - potential of loss of information could lead to costly reentry of searching for the loss data.  This usually occurs when there are duplicates of information
  2. Maintenance of Cleansed Data - costly and time consuming.  Since data has been corrected and changed, redoing the procedure again could lead to unwanted changes and loss of information
  3. Data Cleansing in Virtually Integrated Environments - due from environments being digital, data has to be scrubbed whenever the data is accessed.  Result is decreased response time and efficiency
  4. Data Cleansing Frameworkdue from not being able to create an advance structure to guide the information cleansing, the process involves interaction and repeating steps.  By doing this with a framework for the software to follow, a user can guide the data scrubbing program to correct errors and delete duplication when inputting data.  Data cleansing framework not only can work with a information scrubbing software, it can be used with other data processing stages like integration and maintenance
Website Links for more Information: