Unit 3: Data Preprocessing and Exploration



Data Preprocessing and Exploration

Before applying data mining or analytics techniques, raw data must be cleaned and prepared. This step ensures accuracy, consistency, and better decision-making.

Data Preparation Techniques

A. Data Cleaning

Data cleaning means fixing problems in raw data so analysis becomes reliable. It includes:

  • Handling missing values Example: Replacing blank entries with average values or removing incomplete rows.
  • Removing noise (errors) Example: Correcting wrong spelling of names or unrealistic numbers.
  • Identifying outliers Example: Salary value of ₹10 crore in a dataset of freshers.

B. Data Integration

Combining data from multiple sources into one unified view.
Example:

  • Sales data from CRM + customer data from ERP + website data from Google Analytics.

Challenges:

  • Different formats
  • Conflicting values
  • Duplicate records

Solution: Use metadata and integration rules.

C. Data Transformation

Changing data into a suitable form for analysis:

  • Normalization - Scaling values to a similar range (e.g., 0 – 1)
  • Aggregation - Summarizing: daily sales → monthly sales
  • Encoding - Converting text categories into numbers (e.g., Male = 1, Female = 2)
  • Smoothing - Removing noise by applying averages.

D. Data Reduction

Reducing the size of data while keeping important information:

  • Dimensionality reduction - Removing unnecessary attributes.
  • Sampling - Using a smaller part of large datasets for quick analysis.
  • Compression - Reducing storage size.

E. Discretization

Converting continuous data into categories.

Example:

  • Age → {Youth, Adult, Senior}
  • Income → {Low, Medium, High}

F. Concept Hierarchy

Organizing data in levels from low to high detail.

Example:

  • City → State → Country
  • Product model → Product category → Product family

Used in OLAP for roll-up and drill-down.

Data Exploration

Before mining, analysts explore data using:

  • Descriptive statistics (mean, median, mode)
  • Visualization (charts, histograms)
  • Correlation analysis (relationship between variables)
  • Distribution analysis (normal, skewed)

Helps identify:

  • Trends
  • Patterns
  • Anomalies
  • Data errors

Feature Engineering

Feature Engineering means creating useful input variables (features) that improve model performance.

A. Feature Extraction

Deriving new variables from raw data.

Examples:

  • Extracting “Month” or “Day” from a date
  • Extracting “Frequency of Purchase” from transaction logs
  • Extracting text features using TF-IDF (for sentiment analysis)

B. Feature Transformation

Changing existing features to improve performance.

Examples:

  • Scaling numerical values
  • Log transformation to reduce skewness
  • Encoding categories
  • Binning values (Income → Low/Medium/High)

C. Feature Selection

Choosing only the most important features and removing irrelevant ones.

Benefits:

  • Faster model training
  • Higher accuracy
  • Less complexity

Final Summary

  • Data preprocessing cleans and prepares data.
  • Data exploration helps understand patterns before mining.
  • Feature engineering creates better variables for mining and analytics.

These steps are essential in building accurate business intelligence and data mining solutions.

Visualization and Statistical Summaries

Before performing analytics, businesses need to understand their data. Two major methods are:

1. Data Summarization

Data summarization means describing the dataset in a simple, understandable form.

Types of Summaries:

A. Descriptive Statistics

  • Mean – average value
  • Median – middle value
  • Mode – most frequent value
  • Range – difference between maximum & minimum
  • Standard deviation – how spread out the data is

Example: Company’s monthly sales average = ₹10 lakh. This gives quick insight for managers.

B. Frequency Tables

Shows how often each value appears. Example: Number of customers in age groups (18–25, 26–35, etc.)

C. Cross-Tabulation

Used to understand relationships between two variables. Example: Gender vs Product Preference.

2. Data Visualization for Business

Visualization means showing data using charts and graphs to see trends quickly.

Common Charts Used in Business Analytics:

  • Bar Chart: Compare categories (sales by region)
  • Pie Chart: Show percentage shares (market share)
  • Line Chart: Show trends over time (monthly sales)
  • Histogram: Show distribution of values (customer ages)
  • Scatter Plot: Show relationship between variables (advertising spend vs sales)
  • Heatmaps: Show intensity patterns (website user activity)
  • Dashboards: Real-time KPIs for business managers

Why visualization is important?

  • Makes complex data easy to understand
  • Helps quick decision-making
  • Highlights patterns, trends, and anomalies
  • Communicates insights clearly to management

Issues and Challenges in Data Preprocessing

Working with real-world data is difficult because of several challenges:

High Dimensionality

High dimensionality means data has too many variables (columns).

Example: Customer dataset having 10,000 attributes like age, income, behaviour, clicks, etc.

Problems:

  • Hard to visualize
  • Slow analytics
  • Models become complex and inaccurate (curse of dimensionality)

Solutions:

  • Feature selection
  • Dimensionality reduction (PCA)
  • Remove irrelevant or duplicate features

Scalability

Scalability refers to ability to handle very large volumes of data.

Example: E-commerce platforms store millions of transactions daily.

Problems:

  • Slow processing
  • High storage cost
  • Difficulty in running algorithms on full data

Solutions:

  • Distributed systems (e.g., Hadoop, Spark)
  • Sampling techniques
  • Efficient algorithms

Missing Values

Real-world data often contains blank or incomplete values.

Examples:

  • Customer didn't enter phone number
  • Sensor failed to record temperature
  • Interview form incomplete

Problems:

  • Bias in analysis
  • Incorrect model predictions
  • Loss of important information

Solutions:

  • Replace with mean/median/mode
  • Use predictive models to fill missing data
  • Remove rows or columns with too many missing values

Final Summary 

TopicMeaningWhy Important?
Data SummarizationQuick numeric overviewHelps understand data quickly
VisualizationGraphs & chartsShows trends and patterns
High DimensionalityToo many attributesHard to analyze, slows down models
ScalabilityVery large datasetsRequires efficient tools
Missing ValuesBlank or incomplete dataCauses errors and bias