Total Data Quality Management (TDQM)

Principal Investigator Stuart Madnick

Project Website http://web.mit.edu/tdqm/

In recent years most corporations, large and small, have initiated Total Quality Management (TQM) programs with goals that include 100% satisfaction for customers and no product defects. Quality management programs have been a key factor in the success of companies in many industries.

Often TQM programs and other strategic corporate initiatives are not entirely successful or even fail because the data used to monitor and support organizational processes are incorrect or incomplete or otherwise faulty or inappropriate for a given application. Anecdotal evidence and a growing literature point to data being defective at levels of 10% or more in a variety of applications and industrial contexts, including sales-force automation, direct-mail programs and productivity improvement programs.

MIT's Total Data Quality Management (TDQM) research effort has been grown from industry needs for high quality data. The overall objective of this program is to establish a solid theoretical foundation in this embryonic field and, from this work, to devise practical methods for business and industry to improve data quality. We will develop tools and other capabilities necessary for data quality management in the technical, economic, and organizational phases of business operations.

The TDQM project has both long-term and short-term focuses. The long-term goal of this research program is to create a theory of data quality based on reference disciplines such as computer science, the study of organizational behavior, statistics, accounting, and the total quality management field. This theory of data quality, in turn, may serve as a foundation for other research contexts where the quality of information is an important component. In the short term, the research goal is to create a center of excellence among practitioners of data quality techniques and to act as a clearinghouse for effective methods and project experiences.

There are three major components of the TDQM research program: data quality definition, analysis, and improvement. The definition component focuses on defining and measuring data quality. The analysis component identifies and calculates the impacts of poor quality data, and the benefits of high quality data, on an organization's effectiveness. Finally, the improvement component involves redesigning business practices and implementing new technologies in order to significantly improve the quality of corporate information. Each of these are briefly described below, along with an example and outline of key research directions.

Definition of Data Quality. Although the notion of "data quality" may seem intuitively obvious, data quality is not well defined in current practice. Our studies have revealed that data quality has a number of dimensions for data users, including accuracy, believability, relevancy, and timeliness. A clear and uniform articulation of data quality metrics is needed. In fact, even a relatively obvious dimension, such as accuracy, does not have a sufficiently robust definition to make techniques apparent as to how to measure the accuracy of data. This component of the research addresses issues of data quality definition, measurement, and derivations.

The research issues being addressed are: (a) identification of the key dimensions of data quality, (b) precise and meaningful definitions of each dimension, (c) methods of measuring each dimension for base data, and (d) a data quality algebra (DQA) for computing the quality of derived data.

Analysis of Data Quality Impact on a Business. This component addresses the value chain relationship between high quality data and the successful operation of a business (the flip side is how low quality data negatively impacts a business.) Our analysis techniques relate data quality to key business parameters, such as sales, customer satisfaction, and profitability. To illustrate the importance of this kind of analysis, we describe the case of a transportation company. In this company poor data quality and usage caused 77% of missed deliveries, resulting in significant operating costs due to repetition of work and rerouting of shipments. Even more significant was the finding that the use of poor quality information was the major reason for an estimated loss of market share evaluated at about 1 billion in sales.

The research issues focused on are: (a) quantification of business impact of data quality to firms through a collection of case studies, (b) development of Data Quality Value Chain Analysis (DQVCA) techniques to relate data quality to key business parameters, such as sales, customer satisfaction, and profitability, and (c) development of an economic model of the value of quality data.

Improvement of Data Quality. This component addresses various methods for improving data quality. These methods can be grouped into three interrelated categories: (i) business redesign, (ii) data quality motivation, (iii) use of new technologies, and (iv) data interpretation technology. Business redesign attempts to simply and streamline the operation to minimize the opportunity for data errors to occur. Data quality motivation deals with employee rewards, benefits, and perceptions to encourage more careful attention to improving the quality of data handled by the appropriate members of the organization. New data capture technologies can significantly improve quality through techniques such as automated entry and direct inter-computer communication. Data interpretation technologies assist the user in understanding the meaning of the data so that it is not used incorrectly. For example, in the transportation company example, radio frequency-based data entry devices (for equipment and cargo inventories data capture) were introduced in mobile vehicles which scanned up and down container yards for real-time inventory. This introduced both a new technology and a business redesign, resulting in more accurate and timely data.

The research issues being worked on are: (a) analyzing direct entry technologies, such as mobile computing technologies, neural network techniques for handwriting analysis, and portable communicating terminals, (b) studying connectivity among information systems, (c) representing and automatically using knowledge about the semantics of the data, and (d) creating new paradigms for system design that incorporate data quality tags such as for time andsource.