ParthRaut23-Assignment_4_ydataai

Automated Data Profiling with YData Profiling

Introduction

What is YData Profiling?

In the era of data-driven decision-making, the ability to understand, analyze, and clean data efficiently is crucial. YData Profiling (formerly pandas-profiling) is a powerful open-source Python library that automates exploratory data analysis (EDA) by generating detailed reports on dataset characteristics.

Instead of manually writing numerous lines of code to explore missing values, correlations, distributions, and data quality, YData Profiling simplifies the process with a single command, delivering an interactive, visually rich HTML report

Purpose of YData Profiling

The primary goal of YData Profiling is to streamline the process of understanding datasets before performing complex data analysis or machine learning tasks. It helps users quickly explore datasets without writing extensive code, identify data quality issues such as missing values, duplicates, and outliers, and detect correlations and relationships between variables.

Additionally, it enables users to compare datasets to analyze differences over time or between dataset versions. It also provides capabilities for analyzing time-series and big data efficiently while ensuring data privacy and security by detecting sensitive information such as emails, phone numbers, and personal identifiers.

Why Use YData Profiling?

One of the biggest advantages of YData Profiling is its ability to automate insights and generate detailed analysis reports with minimal effort. This drastically reduces the need for manual exploratory data analysis, saving hours of work.

The reports generated by YData Profiling are interactive and visually appealing, making it easier to identify patterns and insights. Moreover, it is optimized for handling large-scale datasets, ensuring smooth performance even with big data.

For businesses and data professionals dealing with sensitive information, YData Profiling includes privacy-preserving features that help detect and manage sensitive data before sharing or processing it further. It also integrates seamlessly with tools like pandas and Jupyter Notebook, making it an essential addition to any data scientist’s toolkit.

What’s Next?

In this blog, we will explore everything YData Profiling offers—from installation to advanced features like dataset comparison, time-series analysis, metadata extraction, and big data handling.

Let’s dive in and see how YData Profiling can revolutionize your data analysis workflow! By the end of this guide, you’ll learn how to install, use, and interpret YData Profiling reports.

Installation & Setup

Before using YData Profiling, install it along with necessary libraries. Run the following command in your Jupyter Notebook cell:

!pip install ydata-profiling

Import Required Libraries

To start, import the necessary Python libraries.

import pandas as pd
import seaborn as sns
from ydata_profiling import ProfileReport

Key Features & Explainations

Type inference: automatic detection of columns’ data types (Categorical, Numerical, Date, etc.)
Warnings: A summary of the problems/challenges in the data that you might need to work on (missing data, inaccuracies, skewness, etc.)
Univariate analysis: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms
Multivariate analysis: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables pairwise interaction
Time-Series: including different statistical information relative to time dependent data such as auto-correlation and seasonality, along ACF and PACF plots.
Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
File and Image analysis: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata
Compare datasets: one-line solution to enable a fast and complete report on the comparison of datasets
Flexible output formats: all analysis can be exported to an HTML report that can be easily shared with different parties, as JSON for an easy integration in automated systems and as a widget in a Jupyter Notebook.

Code Examples and Illustrations

Load Dataset

For demonstration, we use the Titanic dataset from Seaborn.
This dataset contains details of Titanic passengers, such as age, gender, ticket class, and survival status.

Run the following code to import the dataset.

df = sns.load_dataset("titanic")
df.head()  # Display first 5 rows

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

Generate Automated Data Profiling Report

Now, we generate a detailed profiling report using YData Profiling.

profile = ProfileReport(df, explorative=True)
profile.to_notebook_iframe()  # Display interactive report inside Jupyter Notebook

Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 41/41 [00:07<00:00,  5.58it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.79s/it]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.85s/it]

🔹 To save the report as an HTML file, run the following code:

profile.to_file("ydata_profiling_report.html")

Export report to file: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.75it/s]

Extract Key Insights from Report

Instead of just viewing the report, we can extract useful dataset metadata programmatically.

description = profile.get_description()

# Print available attributes
print(dir(description))

['__annotations__', '__class__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'alerts', 'analysis', 'correlations', 'duplicates', 'missing', 'package', 'sample', 'scatter', 'table', 'time_index_analysis', 'variables']

Dataset Summary

We can extract basic dataset statistics like:

Number of rows and columns
Number of missing values
Number of duplicate rows

dataset_summary = description.table
dataset_summary

{'n': 891,
 'n_var': 15,
 'memory_size': np.int64(285564),
 'record_size': 320.4983164983165,
 'n_cells_missing': np.int64(869),
 'n_vars_with_missing': 4,
 'n_vars_all_missing': 0,
 'p_cells_missing': np.float64(0.06502057613168724),
 'types': {'Categorical': 8, 'Numeric': 4, 'Boolean': 3},
 'n_duplicates': 53,
 'p_duplicates': 0.05948372615039282}

🔹 The output gives an overview of the dataset, including:

Number of variables (columns)
Number of observations (rows)
Percentage of missing values
Number of duplicates

Detect Missing Values

Missing values can affect analysis. Let’s identify columns with missing values.

missing_values = df.isnull().sum()
missing_values[missing_values > 0]  # Show only columns with missing values

age            177
embarked         2
deck           688
embark_town      2
dtype: int64

Identifying & Removing Duplicate Rows

Duplicate records can distort analysis. Let’s check and remove duplicates.

print("Total Duplicates:", df.duplicated().sum())

df_cleaned = df.drop_duplicates()
print("New dataset shape:", df_cleaned.shape)
print(df_cleaned)

Total Duplicates: 0
New dataset shape: (3, 4)
      name  age              email  salary
0    Alice   25  alice@example.com   55000
1      Bob   30      bob@gmail.com   62000
2  Charlie   35  charlie@yahoo.com   72000

Correlation Analysis

Understanding correlations helps in feature selection and detecting relationships between variables.

# Get available correlation matrices
correlations = description.correlations
print(correlations.keys())  # Check which correlation types are available

dict_keys(['auto'])

Comparing Datasets

A key feature of YData Profiling is the ability to compare multiple datasets. This is useful when:

Comparing different versions of a dataset (e.g., before and after cleaning).
Identifying changes over time in datasets.
Checking consistency between training and testing data.

With dataset comparison, you can generate a report highlighting the differences in statistics, correlations, and distributions between two datasets.

# Load two random datasets from seaborn
df1 = sns.load_dataset("titanic")  # Titanic dataset
df2 = sns.load_dataset("tips")     # Tips dataset

# Generate individual reports
titanic_report = ProfileReport(df1, title="titanic")
tips_report = ProfileReport(df2, title="tips")

comparison_report = titanic_report.compare(tips_report)
comparison_report.to_file("comparison.html")

comparison_report.to_notebook_iframe()

C:\Users\hp\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\ydata_profiling\compare_reports.py:191: UserWarning: The datasets being profiled have a different set of columns. Only the left side profile will be calculated.
  warnings.warn(
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 41/41 [00:07<00:00,  5.38it/s, Completed]
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 21.74it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.91s/it]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.96s/it]
Export report to file: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 33.80it/s]

Comparing Titanic and Tips Datasets

We compared the Titanic dataset (passenger survival information) with the Tips dataset (restaurant bill & tip details).

🔹 Key Differences Identified:

Different column structures (passenger vs. restaurant data).
Different numerical distributions (age, fare vs. total_bill, tip).
Categorical variable mismatches (class, sex vs. day, time).
Correlation differences between features in both datasets.

This comparison highlights how YData Profiling detects structural and statistical differences across datasets.

Time-Series Data Profiling with YData Profiling

What is Time-Series Profiling?
Time-series profiling helps analyze trends, missing timestamps, periodicity, and outliers in sequential data (e.g., stock prices, weather records).

Dataset Overview
We created a random time-series dataset with daily values for 2022.

Insights from Profiling Report

Detects missing time gaps.
Identifies trends and seasonality.
Analyzes distributions over time.

# Generate a sample time-series dataset
date_rng = pd.date_range(start="2022-01-01", end="2022-12-31", freq='D')
df_time_series = pd.DataFrame({
    "date": date_rng,
    "value": np.random.randint(10, 100, size=len(date_rng))
})

# Set date column as index
df_time_series.set_index("date", inplace=True)

# Generate Profile Report
profile_time = ProfileReport(df_time_series, title="Time-Series Dataset Report", explorative=True)
profile_time.to_file("time_series_report.html")

# Display in Jupyter Notebook
profile_time.to_notebook_iframe()

Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 18.27it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.95it/s]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.63it/s]
Export report to file: 100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 227.27it/s]

Big Data Profiling with YData Profiling

What is Big Data Profiling?
Big data profiling allows users to analyze datasets that are too large for standard memory processing. It uses sampling and incremental computation to efficiently summarize large datasets.

Dataset Overview
We generated a 1 million-row dataset with random values and categories.

How YData Profiling Handles Big Data?

Uses sampling (e.g., first 1000 rows).
Processes in chunks to reduce memory usage.
Avoids full data scans for efficiency.

# Generate a large dataset with 1 million rows
np.random.seed(42)
big_data = pd.DataFrame({
    "id": range(1, 1000001),
    "value": np.random.randn(1000000),
    "category": np.random.choice(["A", "B", "C", "D"], size=1000000)
})

# Enable sampling to handle large datasets efficiently
profile_big = ProfileReport(big_data, title="Big Data Profiling Report", explorative=True, samples={"head": 1000})

# Save the report
profile_big.to_file("big_data_report.html")

# Display in Jupyter Notebook
profile_big.to_notebook_iframe()

Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 16/16 [00:13<00:00,  1.21it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.81s/it]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.33it/s]
Export report to file: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 42.03it/s]

Detecting Sensitive Data with YData Profiling

Why is this important?
Handling sensitive information (e.g., emails, phone numbers, credit card details) requires caution to comply with data protection laws like GDPR and CCPA.

Dataset Overview
We created a small dataset with:

Names
Emails
Phone numbers
Credit card numbers

How YData Profiling Detects Sensitive Data?

Automatically recognizes PII (Personally Identifiable Information).
Flags columns containing emails, phone numbers, and credit card details.
Helps organizations identify risks and protect data privacy.

# Create a sample dataset with sensitive information
sensitive_data = pd.DataFrame({
    "name": ["Alice Johnson", "Bob Smith", "Charlie Lee"],
    "email": ["alice@example.com", "bob@gmail.com", "charlie@yahoo.com"],
    "phone": ["+1-202-555-0173", "+44-20-7946-0958", "+91-9876543210"],
    "credit_card": ["4111-1111-1111-1111", "5500-0000-0000-0004", "3400-000000-00009"]
})

# Generate a profile report
profile_sensitive = ProfileReport(sensitive_data, title="Sensitive Data Profiling Report", explorative=True)

# Save and display the report
profile_sensitive.to_file("sensitive_data_report.html")
profile_sensitive.to_notebook_iframe()

Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 21.36it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.78s/it]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.14it/s]
Export report to file: 100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 111.52it/s]

Practical Uses of YData Profiling

1. Financial Fraud Detection

Banks and financial institutions use YData Profiling to analyze transaction data, detect anomalies, and identify potential fraudulent activities.

2. Healthcare Data Analysis

Hospitals and research centers leverage it to explore patient records, detect missing values, and ensure data consistency for predictive modeling.

3. E-commerce Customer Insights

E-commerce platforms use YData Profiling to analyze customer behavior, purchase patterns, and product trends, helping optimize marketing strategies.

4. AI Model Training & Feature Engineering

Machine learning engineers use the tool to assess dataset quality, remove redundant features, and ensure the reliability of training data.

5. Data Governance & Compliance

Organizations dealing with sensitive data, such as government agencies and corporations, use it to detect and manage personally identifiable information (PII), ensuring compliance with GDPR and HIPAA.

6. Retail Inventory Management

Retailers use YData Profiling to track inventory data, detect stock discrepancies, and optimize supply chain operations.

7. Comparing Datasets in Business Analytics

Companies use dataset comparison features to monitor changes in sales data, customer engagement metrics, and operational KPIs over time.

8. Time-Series Forecasting for Demand Prediction

Industries like energy and manufacturing analyze time-series datasets using YData Profiling to detect trends and anomalies, improving demand forecasting.

YData Profiling is widely applicable across various domains, helping businesses and researchers make data-driven decisions by providing deep insights into dataset quality and structure.

Conclusion

In this blog, we explored YData Profiling, a powerful tool for automated exploratory data analysis (EDA).

Key Takeaways:

Fast & Efficient: Generates a comprehensive report in seconds.
Detailed Insights: Provides statistical summaries, correlations, missing values, and warnings.
Interactive Reports: Offers HTML, widget-based, and notebook-compatible outputs.
Useful for Data Cleaning: Helps identify outliers, duplicates, and missing data.

Why Use YData Profiling?

Saves Time: Instead of manually writing summary code, get an instant EDA report.
Better Understanding: Quickly assess data quality before building models.
Useful for Any Dataset: Whether working on ML, data science, or analytics, this tool speeds up workflows.

🔹 Final Words:
YData Profiling is a must-have tool for anyone working with structured datasets. By automating EDA, it makes data analysis faster, easier, and more insightful!

References & Further Reading

For more details, official documentation, and additional learning resources, check out these links:

Official YData Profiling Docs: https://ydata-profiling.ydata.ai/docs
GitHub Repository: https://github.com/ydataai/ydata-profiling
YouTube Tutorials on YData Profiling: https://www.youtube.com/results?search_query=ydata+profiling
PyPI Package (Installation Guide): https://pypi.org/project/ydata-profiling/
Pandas Profiling (Older Version): https://pandas-profiling.ydata.ai/docs/master/
Jupyter Notebook Markdown Guide (for formatting help): https://www.markdownguide.org/cheat-sheet/

These links will help you explore more advanced features of YData Profiling and understand how to integrate it into real-world projects.