Are you worried about the quality of your data?
The quality or the lack thereof can be a huge contributing factor to a fractured and sluggish digital journey where ROI is hard to achieve and results come in short supply. The quality of data has a direct impact on the ability of the enterprise to be aware of relevant events, its reaction time, the decision time and its action time. A clear and concerted effort is required to measure and improve the data quality to drive better decisions and actions.
Common Quality Issues
The following are the most common quality issues
Comprehensiveness quality issues refer to key attributes or data points missing from the data collected by the enterprise. This can occur when the data producing systems or the data delivery networks have glitches or malfunction or are incorrectly configured to miss entire rows of data or attributes of the data.
Integrity quality issues refer to the corruption of the values of key attributes to contain unidentifiable or unreadable data. When key attributes are empty or null when they are by design, not allowed to be empty/null or when an attribute contains a value that does not meet the specifications of the type of the attribute for example, a string column contains an integer or a timestamp column contains a string not parse-able into a timestamp. Integrity of data is important before data can be included in the data set to drive analysis, decisions and actions.
Sampling quality issues refer to the inclusion or exclusion of a certain percentage of the records in a data set with the assumption that the remainder records are good, representative sample of the original data set. Bad or inaccurate sampling can lead to a distorted view of reality and that can lead to bad decisions. In addition, sampling itself can make the data set inappropriate for certain types of analysis that require the entire dat set to be utilized for training.
An upstream filtering scheme can end up removing too many or too few records from a data set again rendering it incomplete and inappropriate for the intended analysis. Filtering is a good technique to remove data integrity issues however filtering can suffer from incorrect criteria that can effect a scheme that might be unintended or unforeseen.
Aggregation of data can reduce fidelity of data and mask or hide certain details that might be relevant for the analysis as hand. Aggregation can occur over time e.g. a metric by sec, min, hour, day, week, month, quarter or year. Aggregation can also occur over the values of a descriptive attribute such as metric by location. Aggregation is typically performed and required however when done too early such as data generation time, it can lead to the loss of certain attributes that cannot be included in the aggregated data set.
Data transformations can lead to the addition or removal of certain attributes or records from the data set with new attributes created as a result of functions and operations performed over one or more existing attributes. In addition, data transformation can also result in certain attributes values becoming corrupted or uninterpretable.
Spotting Data Quality Issues
The enterprise should ensure that all data is checked to ensure that it confirms to all type definitions expected. For example, the type of each attribute should be checked and records that contain unrecognized or unexpected values should be ignored or marked as flawed.
The enterprise should ensure that variations in attributes and attribute values including the frequency, volume, velocity of data is tracked and monitored. Unexpected variations should be identified and inspected for unexpected occurrences which can then be either removed or ignored or determined to be acceptable and included in the data set.
Addressing Data Quality Issues
Tracking Lineage of data is important to ensure that the consumers of data are aware of how, when and what data has arrived into the system and the possible changes or processing that the data might have been subject to; transformation that might have changed the structure and meaning of the data.
Tracking provenance of data ensures that the data consumers can understand the source of the data and the conditions, assumptions and biases of where and how the data was produced and delivered.
Enterprises should ensure that each level or stage in the path of how data is delivered should contain validators that test the data, attributes and attribute values for conformity to expected definitions of the data and highlight any records that do not adhere to these definitions. These records can then be appropriately addressed and handled and either included or excluded from the analysis.
Ultimately, enterprises need a data platform that enables continuous and open inspection of all data from multiple users, systems and processes and the ability to ensure that the data confirms to the expected standards. The definition of quality is often temporal and changes from analysis to analysis. Enterprises need a system that can enable this temporal definition of quality. Enterprises often get bogged down in building data platforms and are not able to focus on the quality of their data and the signals it holds.
The AI Company offers an innovative solution to deal with this problem. Talk to us to learn more!