Selecting Data Quality Indicators
Last updated
Last updated
Data : For simplicity of explanation I am explaining this using data mesh architecture and team structure. However all of this is easily portable to centralized architecture. The key here is business context and ownership.
The product owner would then be business owner or consuming app analyst
Data product owners must define success criteria and business-aligned Key Performance Indicators (KPIs) for their data products.
These business needs will vary company to company , data domain to data domain.
Depending on your business KPI, your dataOps team including data owners, data engineers and data infra engineers will first agree on
Key Data product quality metrics like completeness , consistency , freshness
Thresholds for each metric example : 50% tolerance for incompleteness for product data
Monitoring and alerting policies i.e who needs to be proactively notified once there is a unacceptable drift in threshold.
These above should be developed based upon the six data quality dimensions, organizational requirements for data, and the impact on the organization if data is not complying with these rules.
This is most crucial step and needs an alignment within DataOps team and often even business teams inputs.
Examples :
Incorrect or missing email addresses would have a significant impact on any marketing campaigns.
Inaccurate personal details may lead to missed sales opportunities or a rise in customer complaints - goods can get shipped to the wrong locations.
Incorrect product measurements can lead to significant transportation issues i.e. the product will not fit into a truck, alternatively too many trucks may have been ordered for the size of the actual load.
So depending on the business impact, you can define thresholds(tolerance) for the DQ indicators.
Data generally only has value when it supports a business process or organizational decision making. The agreed data quality rules should take account of the value that data can provide to an organization. If it is identified that data has a very high value in a certain context, then this may indicate that more rigorous data quality rules are required in this context.