Interactive Profiling
Last updated
Last updated
Data profiling has been an essential steps of data management for decades now and there are many tools available which help with data profiling.
Most of these tools are static and focus on providing a concise statistical representation of data so dataOps teams can learn about the shape of the data and compare it side by side. There are a variety of command line and desktop tools available. These tools were used for decades by data specialists and are indeed a great starting point. However they are not very helpful when drilling down or trying to figure out problems in the data, due to the fact that they provide very little and very high level information. Adding summaries doesn’t work because the report becomes too complex to interpret and overwhelming, especially if analyzed data has hundreds of attributes of various types.
Interactive profiler tries to mitigate those limitations by allowing exploration of data via a rich user interface which enables investigative journeys into the data.
One of the most useful methods of data exploration is pattern analysis.
Pattern analysis at full fidelity i.e. not limited to top 3 or 10 patterns that is easy to consume and detect outstanding anomalies using pattern distribution.
Users can easily glance through all patterns (or masks) present in the data set and drill down into those of interest and see the related values.
Some tools like Telmai distinguish between several types of patterns, ex. Compressed and Expanded. In this case expanded patterns basically replace each character with their corresponding semantic type (digit, letter, space, spec char), while compressed use a single symbol for the sequence of characters of the same type. For very structured data like phone numbers expanded patterns are more useful, while for people or company names compressed ones make more sense.
Value counts is another tool for data exploration. Just like with patterns most of static profilers would provide top N or least N values in the dataset, however it becomes really powerful when the values and their counts can be analyzed within some filtering conditions, such seeing a list of values having a certain pattern.
Another useful tool is statistical distribution and outlier analysis based on these distributions. On top of standard dimensions on which the distributions are calculated - frequency of a value and it’s length, it can be quite useful to consider less common ones, such as number of spaces, punctuations, special symbols etc.
By applying a score function to the values (like a z-score) based on these dimensions it becomes quite a powerful tool in finding values which are outliers due to various reasons.
When users can sort data on various dimensions as well as apply filtering such as values within for example length or frequency range it becomes a very powerful tool for finding blind spots in the data.
Often time profiling is a pre-step for setting data quality rules, which are often in the form of restrictions or expectations on values, such as value should be of length between 5 and 10 characters, or value frequency should not exceed 1 etc. In this case distribution analysis is there to help. In an interactive manner users can select portions of the distribution histograms for any of the dimensions and filter down the values that satisfy it. This way it is much easier to tune rules to be more precise.
Interactive profiling allows users to gain much deeper insight into data and quality of the data by organizing and simplifying presentation of enormous amounts of statistical information which would otherwise be overwhelming in a form of static profiler output.