Data Quality Metrics

Telmai enables users to monitor a wide range of data quality metrics and receive alerts when anomalies or issues are detected. Alerts serve as warnings that something may not be as expected. Some alerts can be configured to trigger notifications, while others are displayed in the Telmai UI for informational purposes. Additionally, some alerts may initiate remediation actions, such as segregating good data from bad, triggering a circuit breaker for the pipeline, and more.

What is monitored?

  1. Out-of-the-Box Metrics: Telmai automatically monitors a variety of pre-defined metrics related to table metadata and Health KPIs(mentioned in Data Health KPIs)

  2. Custom Metrics: Users can define their own metrics to monitor specific data quality concerns.

Each monitored metric is validated against a set of policies, which define thresholds, scope, notification settings, and more. If a metric violates a policy, an alert is generated. The flow diagram below illustrates how alerts are created.

Out-of-The-Box Metrics

  • Table Level Freshness: Time elapsed since the last change in the monitored table at the time of the scan

  • Record Count: Number of records being scanned

  • Total Table Records Count: Total number of records in the monitored table. This may be larger than the record count if the scan involves only a subset of data, such as when delta detection is configured

  • Correctness: Percentage of records that meet defined data quality rules

  • Completeness: Percentage of records with non-null/non-empty values

  • Record ID Uniqueness: Percentage of unique records based on the configured ID attribute.

  • Uniqueness: Percentage of unique values within an attribute

User Defined Metrics

Users can create custom metrics to track specific anomalies. Telmai supports two types of custom metrics; Expression and SQL

Metric as an Expression

This method enables user to specify a simple aggregation, and grouping; example: SUM(`sales`) group by `region`, `country`

Metrics from Raw SQL

Alternatively, users can write a raw SQL query that returns a single metric, and multiple dimensions. The first returned column must be a numeric value. This numeric value is the tracked metrics. Following columns are used as dimensions.

The SQL query is run against the data connector specified, and is not tied only to the corresponding data asset (i.e. you can join multiple tables in the query)

  • Applicable only for data connectors - BigQuery, Athena, Databricks, Trino, Snowflake, RedShift

  • Table name must be wrapped in backticks `

  • Use valid SQL syntax

  • Ensure the first selected column returns a numeric value

  • Example:

    • SELECT Emp_Salary, Emp_Region FROM `employee_table` WHERE Emp_Age > 60

    • Emp_Salary is the tracked metric

    • Emp_Region is the used dimension

To add a new custom metric:

  1. Select the Dataset: Choose the dataset you want to monitor

  2. Navigate to “Alerting Policies”: Go to the “Metrics” tab

  3. Add a Custom Metric:

    • Click the “+ Custom Metric” button

    • A new window will open, allowing you to define the metric

  4. Select Type

    1. SQL for raw sql syntax

    2. Expression for aggregation expression

  5. Define the Metric:

    • Name: Enter a name for the metric

    • Description: Provide a brief explanation of the metric

    • Click Validate and Save to save the metric

Telmai will start monitoring this metric in future scans

Important: Defining a metric here makes it available for tracking and alerting. However, no alerts will be generated until a policy is created using this metric.

Supported aggregations for Expressions

Below is the list of supported functions

min
max
count
avg
sum
distinct
variance
median
stddev

Aggregation Examples

// Count distinct age for different cities
count(distinct(`Age`)) group by `City`

// Sum of salaries over total count
sum(`salaries`)/count(*)

// Average age for each school and district
avg(`age`) group by `school`, `district`

SQL Examples

 // Fetches salary and names of employees earning over 2000.
 SELECT Emp_Salary, FirstName, LastName FROM `EmployeeTable` WHERE Emp_Salary > 2000;

 //Counts records for each specialty and city combination from the json_3r_a table.
 SELECT count(*),Primary_Specialty, Address.City From `json_3r_a` GROUP BY Primary_Specialty, Address.City
 
 // Categorizes students by age group and lists their details, sorted by group.
 SELECT CASE WHEN Age < 25 THEN 1 WHEN Age BETWEEN 25 AND 30 THEN 2 ELSE 3 END AS age_group, Name, Age, City FROM `students` ORDER BY age_group

Last updated