Basics of profiling

Data profiling is the process of examining, analyzing, and creating useful summaries of any data-set. The process yields a high-level overview which helps in the discovery of data quality issues, risks, and overall trends.

Data profiling produces critical insights into data that data teams can then leverage to build data products.

With the tremendous amount of data available today, dataOps teams are getting more and more overwhelmed by all the information they’ve collected. As a result, they fail to take full advantage of their data, and its value and usefulness diminish.

Data profiling speeds up the design and development of the analytical cloud platform by identifying all of the transformations and data cleansing activities required to transition the data safely.

Data profiling helps organize and manages big data to unlock its full potential and deliver powerful insights.

Common outcome of profiling data

  • Detailed structural schema analysis

  • Data quality analysis to identify data content problem and risk areas

  • Distribution of data values and patterns to identify the different standards and rules inherent to the data

  • Redundant attributes that are either empty/incomplete, or have not been maintained

One of the biggest challenges that data profiling addresses is helping to scope and assess the risks associated with a data migration or integration project.

There are many open source libraries available today for getting static profiling informations. One such pandas can be downloaded from here : https://github.com/pandas-profiling/pandas-profiling

For more interactive data analysis refer next chapter.

Last updated