Completeness
Completeness, is one the most basic and but crucial indicator of data quality
Completeness indicates the total populated data versus the potential 100% complete

Measuring completeness

Completeness is commonly defined as not-null values, but often can go beyond not-null values. So when creating a completeness score we recommend reviewing the following ways to measure it.
Data is populated, at whatever specified levels(refer below). This could be calculated using simple null checks at your datastore level. Data is populated with values that exclude the template/null proxies. Examples of null proxies : 000-000-000, N/A, none, not-defined, Not-applicable , NA
Common unit of Measure: percentage(%)

Examples

Example(s): Customer table of 3M customers has 2.94M populated emails. Then completeness of Emails is 2.94/3 x 100 = 98%
Industry sample(Uber Engineering) : Completeness
Prerequisite
Formula
Upstream and downstream datasets are 1:1 mapped
downstream_row_count / upstream_row_count > completenessSLA
Offline job to ingest sampled upstream data to Hive
sampled_row_count_in_hive / total_sampled_row_count > completenessSLA

Different levels of measurement

Dataset/Database: Total data present across all the tables in the entire dataset or database. Expected values are usually calculated against historical data or against other data sources in the pipeline.
Table/schema : Total data populated in specific tables like Customer, Order, or transactions. Expected values are usually calculated against historical data or against other data sources in the pipeline.
Incomplete data compared between data storage
Metadata: The degree to which both technical and business metadata information is populated. This is usually measured as not-null values for metadata like descriptions, security tags, creation date etc in your data catalog.
Record : Total number records in the dataset. Expected values can be measured against historic trends or lineage across various data sources in the pipeline.
For example, based on historical trends users expect 1M records but receive only 500k causing incomplete records. Or BigQuery has 1M records whereas Cassandra has only 800k records, indicating the records are incomplete.
Attributes: Total number of attributes in a schema definition.Expected value can be specified by business requirements. This is a useful indicator to ensure schema is complete across multiple data sources in your data pipeline and can be tracked through lineage of schema.
For Example, Customer table in BigQuery has 18 attributes, whereas Cassandra has only 16 attributes, often classified as schema mismatch too.
Value: Total numbers of populated attribute values across all rows. For example, if a table has 10 rows with 5 attributes, this will indicates how many values are populated across these 50 values.
An attribute value: Total numbers of populated attribute values for a specific attribute.
For example, if a table has 10 rows with 5 attributes one such attribute is email, this will indicate how many out of the 10 records have email’s populated.
Last modified 1mo ago