Delta Lake

This article outlines the integration steps for Telmai with Delta Tables.

Introduction

Delta Lake is an open source storage layer that brings reliability to data lakes.

Telmai monitors the data in the delta tables to identify anomalies like outliers and drifts. Telmai is designed to read your data once and process it outside your DW architecture to reduce any monitoring load on your Databricks instance.

The section below will walk you through the process of setting up Telmai to monitor the data in delta tables.

Capture required information for Telmai

The first step is to capture the required information about the delta table source which needs to be monitored. All this information can be found in the Databricks cluster which contains the delta table. The section below talks about how to do that.

a.Connection Details.

JDBC connectivity details can be found at the following location in the data bricks workspace console.

Navigate to Compute -> Cluster -> Cluster Name -> Configuration -> Advanced Options -> JDBC/ODBC

Capture the following details:

  1. Server Host name

  2. Port

  3. Http Path

Reference : https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html

b. Security Token

A token is required to connect to the cluster remotely. Token can be created from the data workspace console.

Navigate to the User Name (Top Right Corner) -> User Settings -> Access Token -> Generate New Token.

Capture the token created.

Reference : https://docs.databricks.com/dev-tools/api/latest/authentication.html#token-management

c. Table Details

From the Data section in the Databricks workspace identify the database and table.

Source in Telmai

Once the information is collected; the next step would be to create a source in Telmai. Navigate to the Telmai portal and choose Configuration -> Connect to Data -> Delta Lake. Following screen shows up.

Fill in the details which was captured in the previous step and choose “Connect”. Note that “Schema” maps to the database field of the Delta Lake connection details. This should initiate the source creation and also do a schema introspection of the table to find the top level schema for the table.

Delta Only

With the delta lake tables we can natively monitor the source only for the changed records and analyse the same. This is different from the other source types where we need a timestamp based column to identify the changed records.

To enable delta follow the instructions below. In the Telm.ai configuration, choose the Source->Edit connection -> Advanced.

Enable “Delta Only”check box. This should going forward load only the changes from the previous run.

Implementation

Delta tables are versioned natively. Every write to the delta table creates a new version. The version history can be accessed and we can retrieve the data in the table at the specific version/timestamp. In case delta only is enabled for the Telm.ai source; during the scheduled run we check for newer version availability of the table which is post the last successful analysis job in Telm.ai. In case of availability of newer versions of table we identify the latest available version of the table and do a diff of data from the version of the table during the last run to get the changed records and analyse the same as part of the analysis job.

References :- https://docs.databricks.com/delta/delta-utility.html (See Table History)

Last updated