ParthRaut23-Assignment_4_ydataai

Automated Data Profiling with YData Profiling

Introduction

What is YData Profiling?

In the era of data-driven decision-making, the ability to understand, analyze, and clean data efficiently is crucial. YData Profiling (formerly pandas-profiling) is a powerful open-source Python library that automates exploratory data analysis (EDA) by generating detailed reports on dataset characteristics.

Instead of manually writing numerous lines of code to explore missing values, correlations, distributions, and data quality, YData Profiling simplifies the process with a single command, delivering an interactive, visually rich HTML report

Purpose of YData Profiling

The primary goal of YData Profiling is to streamline the process of understanding datasets before performing complex data analysis or machine learning tasks. It helps users quickly explore datasets without writing extensive code, identify data quality issues such as missing values, duplicates, and outliers, and detect correlations and relationships between variables.

Additionally, it enables users to compare datasets to analyze differences over time or between dataset versions. It also provides capabilities for analyzing time-series and big data efficiently while ensuring data privacy and security by detecting sensitive information such as emails, phone numbers, and personal identifiers.

Why Use YData Profiling?

One of the biggest advantages of YData Profiling is its ability to automate insights and generate detailed analysis reports with minimal effort. This drastically reduces the need for manual exploratory data analysis, saving hours of work.

The reports generated by YData Profiling are interactive and visually appealing, making it easier to identify patterns and insights. Moreover, it is optimized for handling large-scale datasets, ensuring smooth performance even with big data.

For businesses and data professionals dealing with sensitive information, YData Profiling includes privacy-preserving features that help detect and manage sensitive data before sharing or processing it further. It also integrates seamlessly with tools like pandas and Jupyter Notebook, making it an essential addition to any data scientist’s toolkit.

What’s Next?

In this blog, we will explore everything YData Profiling offers—from installation to advanced features like dataset comparison, time-series analysis, metadata extraction, and big data handling.

Let’s dive in and see how YData Profiling can revolutionize your data analysis workflow! By the end of this guide, you’ll learn how to install, use, and interpret YData Profiling reports.


Installation & Setup

Before using YData Profiling, install it along with necessary libraries. Run the following command in your Jupyter Notebook cell:

!pip install ydata-profiling

Import Required Libraries

To start, import the necessary Python libraries.

import pandas as pd
import seaborn as sns
from ydata_profiling import ProfileReport

Key Features & Explainations

Code Examples and Illustrations

Load Dataset

For demonstration, we use the Titanic dataset from Seaborn.
This dataset contains details of Titanic passengers, such as age, gender, ticket class, and survival status.

Run the following code to import the dataset.

df = sns.load_dataset("titanic")
df.head()  # Display first 5 rows
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

Generate Automated Data Profiling Report

Now, we generate a detailed profiling report using YData Profiling.

profile = ProfileReport(df, explorative=True)
profile.to_notebook_iframe()  # Display interactive report inside Jupyter Notebook
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 41/41 [00:07<00:00,  5.58it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.79s/it]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.85s/it]

🔹 To save the report as an HTML file, run the following code:

profile.to_file("ydata_profiling_report.html")
Export report to file: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.75it/s]

Extract Key Insights from Report

Instead of just viewing the report, we can extract useful dataset metadata programmatically.

description = profile.get_description()

# Print available attributes
print(dir(description))
['__annotations__', '__class__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'alerts', 'analysis', 'correlations', 'duplicates', 'missing', 'package', 'sample', 'scatter', 'table', 'time_index_analysis', 'variables']

Dataset Summary

We can extract basic dataset statistics like:

dataset_summary = description.table
dataset_summary
{'n': 891,
 'n_var': 15,
 'memory_size': np.int64(285564),
 'record_size': 320.4983164983165,
 'n_cells_missing': np.int64(869),
 'n_vars_with_missing': 4,
 'n_vars_all_missing': 0,
 'p_cells_missing': np.float64(0.06502057613168724),
 'types': {'Categorical': 8, 'Numeric': 4, 'Boolean': 3},
 'n_duplicates': 53,
 'p_duplicates': 0.05948372615039282}

🔹 The output gives an overview of the dataset, including:

Detect Missing Values

Missing values can affect analysis. Let’s identify columns with missing values.

missing_values = df.isnull().sum()
missing_values[missing_values > 0]  # Show only columns with missing values
age            177
embarked         2
deck           688
embark_town      2
dtype: int64

Identifying & Removing Duplicate Rows

Duplicate records can distort analysis. Let’s check and remove duplicates.

print("Total Duplicates:", df.duplicated().sum())

df_cleaned = df.drop_duplicates()
print("New dataset shape:", df_cleaned.shape)
print(df_cleaned)
Total Duplicates: 0
New dataset shape: (3, 4)
      name  age              email  salary
0    Alice   25  alice@example.com   55000
1      Bob   30      bob@gmail.com   62000
2  Charlie   35  charlie@yahoo.com   72000

Correlation Analysis

Understanding correlations helps in feature selection and detecting relationships between variables.

# Get available correlation matrices
correlations = description.correlations
print(correlations.keys())  # Check which correlation types are available
dict_keys(['auto'])

Comparing Datasets

A key feature of YData Profiling is the ability to compare multiple datasets. This is useful when:

With dataset comparison, you can generate a report highlighting the differences in statistics, correlations, and distributions between two datasets.

# Load two random datasets from seaborn
df1 = sns.load_dataset("titanic")  # Titanic dataset
df2 = sns.load_dataset("tips")     # Tips dataset

# Generate individual reports
titanic_report = ProfileReport(df1, title="titanic")
tips_report = ProfileReport(df2, title="tips")

comparison_report = titanic_report.compare(tips_report)
comparison_report.to_file("comparison.html")

comparison_report.to_notebook_iframe()
C:\Users\hp\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\ydata_profiling\compare_reports.py:191: UserWarning: The datasets being profiled have a different set of columns. Only the left side profile will be calculated.
  warnings.warn(
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 41/41 [00:07<00:00,  5.38it/s, Completed]
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 21.74it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.91s/it]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.96s/it]
Export report to file: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 33.80it/s]

Comparing Titanic and Tips Datasets

We compared the Titanic dataset (passenger survival information) with the Tips dataset (restaurant bill & tip details).

🔹 Key Differences Identified:

This comparison highlights how YData Profiling detects structural and statistical differences across datasets.

Time-Series Data Profiling with YData Profiling

What is Time-Series Profiling?
Time-series profiling helps analyze trends, missing timestamps, periodicity, and outliers in sequential data (e.g., stock prices, weather records).

Dataset Overview
We created a random time-series dataset with daily values for 2022.

Insights from Profiling Report

# Generate a sample time-series dataset
date_rng = pd.date_range(start="2022-01-01", end="2022-12-31", freq='D')
df_time_series = pd.DataFrame({
    "date": date_rng,
    "value": np.random.randint(10, 100, size=len(date_rng))
})

# Set date column as index
df_time_series.set_index("date", inplace=True)

# Generate Profile Report
profile_time = ProfileReport(df_time_series, title="Time-Series Dataset Report", explorative=True)
profile_time.to_file("time_series_report.html")

# Display in Jupyter Notebook
profile_time.to_notebook_iframe()
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 18.27it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.95it/s]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.63it/s]
Export report to file: 100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 227.27it/s]

Big Data Profiling with YData Profiling

What is Big Data Profiling?
Big data profiling allows users to analyze datasets that are too large for standard memory processing. It uses sampling and incremental computation to efficiently summarize large datasets.

Dataset Overview
We generated a 1 million-row dataset with random values and categories.

How YData Profiling Handles Big Data?

# Generate a large dataset with 1 million rows
np.random.seed(42)
big_data = pd.DataFrame({
    "id": range(1, 1000001),
    "value": np.random.randn(1000000),
    "category": np.random.choice(["A", "B", "C", "D"], size=1000000)
})

# Enable sampling to handle large datasets efficiently
profile_big = ProfileReport(big_data, title="Big Data Profiling Report", explorative=True, samples={"head": 1000})

# Save the report
profile_big.to_file("big_data_report.html")

# Display in Jupyter Notebook
profile_big.to_notebook_iframe()
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 16/16 [00:13<00:00,  1.21it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.81s/it]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.33it/s]
Export report to file: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 42.03it/s]

Detecting Sensitive Data with YData Profiling

Why is this important?
Handling sensitive information (e.g., emails, phone numbers, credit card details) requires caution to comply with data protection laws like GDPR and CCPA.

Dataset Overview
We created a small dataset with:

How YData Profiling Detects Sensitive Data?

# Create a sample dataset with sensitive information
sensitive_data = pd.DataFrame({
    "name": ["Alice Johnson", "Bob Smith", "Charlie Lee"],
    "email": ["alice@example.com", "bob@gmail.com", "charlie@yahoo.com"],
    "phone": ["+1-202-555-0173", "+44-20-7946-0958", "+91-9876543210"],
    "credit_card": ["4111-1111-1111-1111", "5500-0000-0000-0004", "3400-000000-00009"]
})

# Generate a profile report
profile_sensitive = ProfileReport(sensitive_data, title="Sensitive Data Profiling Report", explorative=True)

# Save and display the report
profile_sensitive.to_file("sensitive_data_report.html")
profile_sensitive.to_notebook_iframe()
Summarize dataset: 100%|███████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 21.36it/s, Completed]
Generate report structure: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.78s/it]
Render HTML: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.14it/s]
Export report to file: 100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 111.52it/s]

Practical Uses of YData Profiling

1. Financial Fraud Detection

Banks and financial institutions use YData Profiling to analyze transaction data, detect anomalies, and identify potential fraudulent activities.

2. Healthcare Data Analysis

Hospitals and research centers leverage it to explore patient records, detect missing values, and ensure data consistency for predictive modeling.

3. E-commerce Customer Insights

E-commerce platforms use YData Profiling to analyze customer behavior, purchase patterns, and product trends, helping optimize marketing strategies.

4. AI Model Training & Feature Engineering

Machine learning engineers use the tool to assess dataset quality, remove redundant features, and ensure the reliability of training data.

5. Data Governance & Compliance

Organizations dealing with sensitive data, such as government agencies and corporations, use it to detect and manage personally identifiable information (PII), ensuring compliance with GDPR and HIPAA.

6. Retail Inventory Management

Retailers use YData Profiling to track inventory data, detect stock discrepancies, and optimize supply chain operations.

7. Comparing Datasets in Business Analytics

Companies use dataset comparison features to monitor changes in sales data, customer engagement metrics, and operational KPIs over time.

8. Time-Series Forecasting for Demand Prediction

Industries like energy and manufacturing analyze time-series datasets using YData Profiling to detect trends and anomalies, improving demand forecasting.

YData Profiling is widely applicable across various domains, helping businesses and researchers make data-driven decisions by providing deep insights into dataset quality and structure.

Conclusion

In this blog, we explored YData Profiling, a powerful tool for automated exploratory data analysis (EDA).

Key Takeaways:

Why Use YData Profiling?

🔹 Final Words:
YData Profiling is a must-have tool for anyone working with structured datasets. By automating EDA, it makes data analysis faster, easier, and more insightful!

References & Further Reading

For more details, official documentation, and additional learning resources, check out these links:

These links will help you explore more advanced features of YData Profiling and understand how to integrate it into real-world projects.