Telmai Academy
  • Data Quality and Observability Academy
  • Basics of Data Observability
  • Data Quality Indicators
    • Introduction - Indicators of Data Quality
    • Selecting Data Quality Indicators
    • Completeness
    • Uniqueness
    • Freshness
    • Validity
    • Accuracy
    • Consistency
    • Data Lineage
  • Advanced Topic: Implementing DQ indicators
    • Completeness
      • Built-in
      • User-Defined
  • Correctness
    • Categorical (Nominal or Ordinal)
    • Numerical (Discrete or Continuous)
    • Structured
    • Semi-Structured
    • Unstructured
    • Uncommon Types
    • Designated Values
  • Profiling data
    • Basics of profiling
    • Interactive Profiling
  • Monitoring data quality
  • Monitoring definitions
    • SLO
    • SLI
    • Policies
    • Setting up policies and alerting
  • Monitoring Sources
Powered by GitBook
On this page

Was this helpful?

  1. Profiling data

Basics of profiling

PreviousDesignated ValuesNextInteractive Profiling

Last updated 3 years ago

Was this helpful?

Data profiling is the process of examining, analyzing, and creating useful summaries of any data-set. The process yields a high-level overview which helps in the discovery of data quality issues, risks, and overall trends.

Data profiling produces critical insights into data that data teams can then leverage to build data products.

With the tremendous amount of data available today, dataOps teams are getting more and more overwhelmed by all the information they’ve collected. As a result, they fail to take full advantage of their data, and its value and usefulness diminish.

Data profiling speeds up the design and development of the analytical cloud platform by identifying all of the transformations and data cleansing activities required to transition the data safely.

Data profiling helps organize and manages big data to unlock its full potential and deliver powerful insights.

Common outcome of profiling data

  • Detailed structural schema analysis

  • Data quality analysis to identify data content problem and risk areas

  • Distribution of data values and patterns to identify the different standards and rules inherent to the data

  • Redundant attributes that are either empty/incomplete, or have not been maintained

One of the biggest challenges that data profiling addresses is helping to scope and assess the risks associated with a data migration or integration project.

For more interactive data analysis refer next chapter.

There are many open source libraries available today for getting static profiling informations. One such pandas can be downloaded from here :

https://github.com/pandas-profiling/pandas-profiling
High level profiling information