Unit 3: Data Preprocessing and Exploration
Data Preprocessing and Exploration
Before applying data mining or analytics techniques, raw data must be cleaned and prepared. This step ensures accuracy, consistency, and better decision-making.
Data Preparation Techniques
A. Data Cleaning
Data cleaning means fixing problems in raw data so analysis becomes reliable. It includes:
- Handling missing values Example: Replacing blank entries with average values or removing incomplete rows.
- Removing noise (errors) Example: Correcting wrong spelling of names or unrealistic numbers.
- Identifying outliers Example: Salary value of ₹10 crore in a dataset of freshers.
B. Data Integration
Combining data from multiple sources into one unified view.
Example:
-
Sales data from CRM + customer data from ERP + website data from Google Analytics.
Challenges:
- Different formats
- Conflicting values
- Duplicate records
Solution: Use metadata and integration rules.
C. Data Transformation
Changing data into a suitable form for analysis:
- Normalization - Scaling values to a similar range (e.g., 0 – 1)
- Aggregation - Summarizing: daily sales → monthly sales
- Encoding - Converting text categories into numbers (e.g., Male = 1, Female = 2)
- Smoothing - Removing noise by applying averages.
D. Data Reduction
Reducing the size of data while keeping important information:
- Dimensionality reduction - Removing unnecessary attributes.
- Sampling - Using a smaller part of large datasets for quick analysis.
- Compression - Reducing storage size.
E. Discretization
Converting continuous data into categories.
Example:
- Age → {Youth, Adult, Senior}
- Income → {Low, Medium, High}
F. Concept Hierarchy
Organizing data in levels from low to high detail.
Example:
- City → State → Country
- Product model → Product category → Product family
Used in OLAP for roll-up and drill-down.
Data Exploration
Before mining, analysts explore data using:
- Descriptive statistics (mean, median, mode)
- Visualization (charts, histograms)
- Correlation analysis (relationship between variables)
- Distribution analysis (normal, skewed)
Helps identify:
- Trends
- Patterns
- Anomalies
- Data errors
Feature Engineering
Feature Engineering means creating useful input variables (features) that improve model performance.
A. Feature Extraction
Deriving new variables from raw data.
Examples:
- Extracting “Month” or “Day” from a date
- Extracting “Frequency of Purchase” from transaction logs
- Extracting text features using TF-IDF (for sentiment analysis)
B. Feature Transformation
Changing existing features to improve performance.
Examples:
- Scaling numerical values
- Log transformation to reduce skewness
- Encoding categories
- Binning values (Income → Low/Medium/High)
C. Feature Selection
Choosing only the most important features and removing irrelevant ones.
Benefits:
- Faster model training
- Higher accuracy
- Less complexity
Final Summary
- Data preprocessing cleans and prepares data.
- Data exploration helps understand patterns before mining.
- Feature engineering creates better variables for mining and analytics.
These steps are essential in building accurate business intelligence and data mining solutions.
Visualization and Statistical Summaries
Before performing analytics, businesses need to understand their data. Two major methods are:
1. Data Summarization
Data summarization means describing the dataset in a simple, understandable form.
Types of Summaries:
A. Descriptive Statistics
- Mean – average value
- Median – middle value
- Mode – most frequent value
- Range – difference between maximum & minimum
- Standard deviation – how spread out the data is
Example: Company’s monthly sales average = ₹10 lakh. This gives quick insight for managers.
B. Frequency Tables
Shows how often each value appears. Example: Number of customers in age groups (18–25, 26–35, etc.)
C. Cross-Tabulation
Used to understand relationships between two variables. Example: Gender vs Product Preference.
2. Data Visualization for Business
Visualization means showing data using charts and graphs to see trends quickly.
Common Charts Used in Business Analytics:
- Bar Chart: Compare categories (sales by region)
- Pie Chart: Show percentage shares (market share)
- Line Chart: Show trends over time (monthly sales)
- Histogram: Show distribution of values (customer ages)
- Scatter Plot: Show relationship between variables (advertising spend vs sales)
- Heatmaps: Show intensity patterns (website user activity)
- Dashboards: Real-time KPIs for business managers
Why visualization is important?
- Makes complex data easy to understand
- Helps quick decision-making
- Highlights patterns, trends, and anomalies
- Communicates insights clearly to management
Issues and Challenges in Data Preprocessing
Working with real-world data is difficult because of several challenges:
High Dimensionality
High dimensionality means data has too many variables (columns).
Example: Customer dataset having 10,000 attributes like age, income, behaviour, clicks, etc.
Problems:
- Hard to visualize
- Slow analytics
- Models become complex and inaccurate (curse of dimensionality)
Solutions:
- Feature selection
- Dimensionality reduction (PCA)
- Remove irrelevant or duplicate features
Scalability
Scalability refers to ability to handle very large volumes of data.
Example: E-commerce platforms store millions of transactions daily.
Problems:
- Slow processing
- High storage cost
- Difficulty in running algorithms on full data
Solutions:
- Distributed systems (e.g., Hadoop, Spark)
- Sampling techniques
- Efficient algorithms
Missing Values
Real-world data often contains blank or incomplete values.
Examples:
- Customer didn't enter phone number
- Sensor failed to record temperature
- Interview form incomplete
Problems:
- Bias in analysis
- Incorrect model predictions
- Loss of important information
Solutions:
- Replace with mean/median/mode
- Use predictive models to fill missing data
- Remove rows or columns with too many missing values
Final Summary
| Topic | Meaning | Why Important? |
|---|---|---|
| Data Summarization | Quick numeric overview | Helps understand data quickly |
| Visualization | Graphs & charts | Shows trends and patterns |
| High Dimensionality | Too many attributes | Hard to analyze, slows down models |
| Scalability | Very large datasets | Requires efficient tools |
| Missing Values | Blank or incomplete data | Causes errors and bias |