Knowledge Base | Wizata

Data quality-Gathering best raw material to feed your analytics engine

Written by Raphaël Cayrol | 19 November 2018

The quality of your data influences every process that depends on it. If the data is inconsequential or fragmentary, insights derived from it will either be false, misleading or biased. Ziad Benslimane, our Lead Data Engineer, points out key aspects of data that determine business decisions outcomes.

In all but the rarest cases, data quality management shouldn’t happen in parallel to the pure data science experiments but ensure beforehand that the input data has all the right attributes. The quality of the data is contingent upon the context of our goals. “It depends on the business issues we’re trying to solve: the quality of a metal sheet isn’t comparable to the quality of a web page”, says Ziad Benslimane. “For example, the timeliness of the information could be of upmost importance in a manufacturing plant, where every millisecond counts, whereas crawling through a website isn’t something that must be done constantly.”

For an analytics pipeline to work, “you need someone from the business side that understands the data enough to verify it, identify and correct errors if necessary. The business expert can for example quickly identify that a negative weight is problematic, or, on the other hand, that a null value is correct if the material wasn’t weighted at its current step yet. You can ease this process by offering the business user easy-to-use but powerful self-service tools such as PowerBI, that give the business user the information it needs”.

Adding an IT expert that knows where the data is coming from, how to manage it and how to fix back-end issues, you then have all the building blocks for an exploratory workshop, before conducting a feasibility study, composed of the business understanding and the data understanding phases.

In the context of manufacturing, what specific attributes should you aim for?

Accuracy
Are data values stored for an object the correct values?

Validity
Is the data following the business rules, such as the range of possible values (e.g.: the length of a coil in the steel industry), the list of acceptable values, the data type? Is there null data? Are there empty fields?

Completeness
Is there all the information needed to understand the business issues? What percentage of the required dataset is available?

Reliability
Is the data complete and sufficiently error-free to be useful?

Consistency
Is the data usability adequate and constant over time?

Timeliness
Is the data needed in real time? Is the granularity of the recorded data suitable? Is lag a problem? Do you need to analyze the sensor data near real time to provide useful insights?

Accessibility
Can you store, retrieve and act on the data where it is stored?