Completeness

Completeness, is one the most basic and but crucial indicator of data quality

Completeness is defined as not-null values but often can go beyond not-null values. When creating a completeness score, we recommend reviewing the following ways to measure it.

Data is populated at a specified level (refer to below); this can be calculated using simple null checks at your datastore level. Data is populated with values that exclude the template/null proxies. Examples of null proxies : 000-000-000, N/A, none, not-defined, Not-applicable , NA

The standard unit of Measure: percentage(%) Examples Example(s): Customer table of 3M customers has 2.94M populated emails.

Then completeness of Emails is 2.94/3 x 100 = 98%

Industry sample(Uber Engineering) : Completeness

Prerequisite

Formula

Upstream and downstream datasets are 1:1 mapped

downstream_row_count / upstream_row_count > completenessSLA

Offline job to ingest sampled upstream data to Hive

sampled_row_count_in_hive / total_sampled_row_count > completenessSLA

Different levels of measurement

Dataset/Database: Total data present across all the tables in the entire dataset or database. Expected values are usually calculated against historical data or against other data sources in the pipeline.

Table/schema : Total data populated in specific tables like Customer, Order, or transactions. Expected values are usually calculated against historical data or against other data sources in the pipeline.

Metadata: The degree to which both technical and business metadata information is populated. This is usually measured as not-null values for metadata like descriptions, security tags, creation date etc in your data catalog.

Record : Total number records in the dataset. Expected values can be measured against historic trends or lineage across various data sources in the pipeline.

For example, based on historical trends users expect 1M records but receive only 500k causing incomplete records. Or BigQuery has 1M records whereas Cassandra has only 800k records, indicating the records are incomplete.

Attributes: Total number of attributes in a schema definition.Expected value can be specified by business requirements. This is a useful indicator to ensure schema is complete across multiple data sources in your data pipeline and can be tracked through lineage of schema.

For Example, Customer table in BigQuery has 18 attributes, whereas Cassandra has only 16 attributes, often classified as schema mismatch too.

Value: Total numbers of populated attribute values across all rows. For example, if a table has 10 rows with 5 attributes, this will indicates how many values are populated across these 50 values.

An attribute value: Total numbers of populated attribute values for a specific attribute.

For example, if a table has 10 rows with 5 attributes one such attribute is email, this will indicate how many out of the 10 records have email’s populated.

PreviousSelecting Data Quality Indicators NextUniqueness

Last updated 2 years ago

Was this helpful?