In the era of data-driven decision-making, the ability to understand, analyze, and clean data efficiently is crucial. YData Profiling (formerly pandas-profiling) is a powerful open-source Python library that automates exploratory data analysis (EDA) by generating detailed reports on dataset characteristics.
Instead of manually writing numerous lines of code to explore missing values, correlations, distributions, and data quality, YData Profiling simplifies the process with a single command, delivering an interactive, visually rich HTML report
The primary goal of YData Profiling is to streamline the process of understanding datasets before performing complex data analysis or machine learning tasks. It helps users quickly explore datasets without writing extensive code, identify data quality issues such as missing values, duplicates, and outliers, and detect correlations and relationships between variables.
Additionally, it enables users to compare datasets to analyze differences over time or between dataset versions. It also provides capabilities for analyzing time-series and big data efficiently while ensuring data privacy and security by detecting sensitive information such as emails, phone numbers, and personal identifiers.
One of the biggest advantages of YData Profiling is its ability to automate insights and generate detailed analysis reports with minimal effort. This drastically reduces the need for manual exploratory data analysis, saving hours of work.
The reports generated by YData Profiling are interactive and visually appealing, making it easier to identify patterns and insights. Moreover, it is optimized for handling large-scale datasets, ensuring smooth performance even with big data.
For businesses and data professionals dealing with sensitive information, YData Profiling includes privacy-preserving features that help detect and manage sensitive data before sharing or processing it further. It also integrates seamlessly with tools like pandas and Jupyter Notebook, making it an essential addition to any data scientist’s toolkit.
In this blog, we will explore everything YData Profiling offers—from installation to advanced features like dataset comparison, time-series analysis, metadata extraction, and big data handling.
Let’s dive in and see how YData Profiling can revolutionize your data analysis workflow! By the end of this guide, you’ll learn how to install, use, and interpret YData Profiling reports.
Before using YData Profiling, install it along with necessary libraries. Run the following command in your Jupyter Notebook cell:
!pip install ydata-profiling
To start, import the necessary Python libraries.
import pandas as pd
import seaborn as sns
from ydata_profiling import ProfileReport
For demonstration, we use the Titanic dataset from Seaborn.
This dataset contains details of Titanic passengers, such as age, gender, ticket class, and survival status.
Run the following code to import the dataset.
df = sns.load_dataset("titanic")
df.head() # Display first 5 rows
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
Now, we generate a detailed profiling report using YData Profiling.
profile = ProfileReport(df, explorative=True)
profile.to_notebook_iframe() # Display interactive report inside Jupyter Notebook
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 41/41 [00:07<00:00, 5.58it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.79s/it]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.85s/it]
🔹 To save the report as an HTML file, run the following code:
profile.to_file("ydata_profiling_report.html")
Export report to file: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.75it/s]
Instead of just viewing the report, we can extract useful dataset metadata programmatically.
description = profile.get_description()
# Print available attributes
print(dir(description))
['__annotations__', '__class__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'alerts', 'analysis', 'correlations', 'duplicates', 'missing', 'package', 'sample', 'scatter', 'table', 'time_index_analysis', 'variables']
We can extract basic dataset statistics like:
dataset_summary = description.table
dataset_summary
{'n': 891,
'n_var': 15,
'memory_size': np.int64(285564),
'record_size': 320.4983164983165,
'n_cells_missing': np.int64(869),
'n_vars_with_missing': 4,
'n_vars_all_missing': 0,
'p_cells_missing': np.float64(0.06502057613168724),
'types': {'Categorical': 8, 'Numeric': 4, 'Boolean': 3},
'n_duplicates': 53,
'p_duplicates': 0.05948372615039282}
🔹 The output gives an overview of the dataset, including:
Missing values can affect analysis. Let’s identify columns with missing values.
missing_values = df.isnull().sum()
missing_values[missing_values > 0] # Show only columns with missing values
age 177
embarked 2
deck 688
embark_town 2
dtype: int64
Duplicate records can distort analysis. Let’s check and remove duplicates.
print("Total Duplicates:", df.duplicated().sum())
df_cleaned = df.drop_duplicates()
print("New dataset shape:", df_cleaned.shape)
print(df_cleaned)
Total Duplicates: 0
New dataset shape: (3, 4)
name age email salary
0 Alice 25 alice@example.com 55000
1 Bob 30 bob@gmail.com 62000
2 Charlie 35 charlie@yahoo.com 72000
Understanding correlations helps in feature selection and detecting relationships between variables.
# Get available correlation matrices
correlations = description.correlations
print(correlations.keys()) # Check which correlation types are available
dict_keys(['auto'])
A key feature of YData Profiling is the ability to compare multiple datasets. This is useful when:
With dataset comparison, you can generate a report highlighting the differences in statistics, correlations, and distributions between two datasets.
# Load two random datasets from seaborn
df1 = sns.load_dataset("titanic") # Titanic dataset
df2 = sns.load_dataset("tips") # Tips dataset
# Generate individual reports
titanic_report = ProfileReport(df1, title="titanic")
tips_report = ProfileReport(df2, title="tips")
comparison_report = titanic_report.compare(tips_report)
comparison_report.to_file("comparison.html")
comparison_report.to_notebook_iframe()
C:\Users\hp\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\ydata_profiling\compare_reports.py:191: UserWarning: The datasets being profiled have a different set of columns. Only the left side profile will be calculated.
warnings.warn(
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 41/41 [00:07<00:00, 5.38it/s, Completed]
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 21.74it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.91s/it]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.96s/it]
Export report to file: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 33.80it/s]
We compared the Titanic dataset (passenger survival information) with the Tips dataset (restaurant bill & tip details).
🔹 Key Differences Identified:
This comparison highlights how YData Profiling detects structural and statistical differences across datasets.
What is Time-Series Profiling?
Time-series profiling helps analyze trends, missing timestamps, periodicity, and outliers in sequential data (e.g., stock prices, weather records).
Dataset Overview
We created a random time-series dataset with daily values for 2022.
Insights from Profiling Report
# Generate a sample time-series dataset
date_rng = pd.date_range(start="2022-01-01", end="2022-12-31", freq='D')
df_time_series = pd.DataFrame({
"date": date_rng,
"value": np.random.randint(10, 100, size=len(date_rng))
})
# Set date column as index
df_time_series.set_index("date", inplace=True)
# Generate Profile Report
profile_time = ProfileReport(df_time_series, title="Time-Series Dataset Report", explorative=True)
profile_time.to_file("time_series_report.html")
# Display in Jupyter Notebook
profile_time.to_notebook_iframe()
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 18.27it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.95it/s]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7.63it/s]
Export report to file: 100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 227.27it/s]
What is Big Data Profiling?
Big data profiling allows users to analyze datasets that are too large for standard memory processing. It uses sampling and incremental computation to efficiently summarize large datasets.
Dataset Overview
We generated a 1 million-row dataset with random values and categories.
How YData Profiling Handles Big Data?
# Generate a large dataset with 1 million rows
np.random.seed(42)
big_data = pd.DataFrame({
"id": range(1, 1000001),
"value": np.random.randn(1000000),
"category": np.random.choice(["A", "B", "C", "D"], size=1000000)
})
# Enable sampling to handle large datasets efficiently
profile_big = ProfileReport(big_data, title="Big Data Profiling Report", explorative=True, samples={"head": 1000})
# Save the report
profile_big.to_file("big_data_report.html")
# Display in Jupyter Notebook
profile_big.to_notebook_iframe()
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 16/16 [00:13<00:00, 1.21it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.81s/it]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.33it/s]
Export report to file: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 42.03it/s]
Why is this important?
Handling sensitive information (e.g., emails, phone numbers, credit card details) requires caution to comply with data protection laws like GDPR and CCPA.
Dataset Overview
We created a small dataset with:
How YData Profiling Detects Sensitive Data?
# Create a sample dataset with sensitive information
sensitive_data = pd.DataFrame({
"name": ["Alice Johnson", "Bob Smith", "Charlie Lee"],
"email": ["alice@example.com", "bob@gmail.com", "charlie@yahoo.com"],
"phone": ["+1-202-555-0173", "+44-20-7946-0958", "+91-9876543210"],
"credit_card": ["4111-1111-1111-1111", "5500-0000-0000-0004", "3400-000000-00009"]
})
# Generate a profile report
profile_sensitive = ProfileReport(sensitive_data, title="Sensitive Data Profiling Report", explorative=True)
# Save and display the report
profile_sensitive.to_file("sensitive_data_report.html")
profile_sensitive.to_notebook_iframe()
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 21.36it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.78s/it]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.14it/s]
Export report to file: 100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 111.52it/s]
Banks and financial institutions use YData Profiling to analyze transaction data, detect anomalies, and identify potential fraudulent activities.
Hospitals and research centers leverage it to explore patient records, detect missing values, and ensure data consistency for predictive modeling.
E-commerce platforms use YData Profiling to analyze customer behavior, purchase patterns, and product trends, helping optimize marketing strategies.
Machine learning engineers use the tool to assess dataset quality, remove redundant features, and ensure the reliability of training data.
Organizations dealing with sensitive data, such as government agencies and corporations, use it to detect and manage personally identifiable information (PII), ensuring compliance with GDPR and HIPAA.
Retailers use YData Profiling to track inventory data, detect stock discrepancies, and optimize supply chain operations.
Companies use dataset comparison features to monitor changes in sales data, customer engagement metrics, and operational KPIs over time.
Industries like energy and manufacturing analyze time-series datasets using YData Profiling to detect trends and anomalies, improving demand forecasting.
YData Profiling is widely applicable across various domains, helping businesses and researchers make data-driven decisions by providing deep insights into dataset quality and structure.
In this blog, we explored YData Profiling, a powerful tool for automated exploratory data analysis (EDA).
🔹 Final Words:
YData Profiling is a must-have tool for anyone working with structured datasets. By automating EDA, it makes data analysis faster, easier, and more insightful!
For more details, official documentation, and additional learning resources, check out these links:
These links will help you explore more advanced features of YData Profiling and understand how to integrate it into real-world projects.