Data Mining
Overview of Data Mining
In today’s digital world, organizations store huge amounts of data in databases, data warehouses, and cloud systems. However, raw data alone is not useful unless we extract hidden patterns, trends, and knowledge from it.
Data Mining is the process of discovering useful information from large datasets.
In simple words: Data Mining = Finding valuable patterns from large data
Motivation for Data Mining
Why do we need data mining?
| Reason | Explanation |
|---|---|
| Huge data growth | Manual analysis is impossible |
| Business competition | Better decisions give advantage |
| Automation | Computers find patterns faster |
| Hidden knowledge | Patterns not visible to humans |
| Prediction | Forecast future trends |
Real-Life Examples
- Banks detect fraud transactions
- E-commerce recommends products
- Colleges analyze student performance
- Hospitals predict disease risks
Definition of Data Mining
Data Mining is the process of extracting interesting, non-trivial, previously unknown, and useful patterns or knowledge from large datasets.
Key Terms Explained
| Term | Meaning |
|---|---|
| Interesting | Useful for decision making |
| Non-trivial | Not obvious |
| Large datasets | Huge volume of data |
| Knowledge | Patterns, rules, predictions |
Functionalities of Data Mining
These are the main tasks performed in data mining.
| Functionality | Description |
|---|---|
| Characterization | Summarizes data features |
| Discrimination | Compares different classes |
| Association | Finds relationships (e.g., market basket) |
| Classification | Assigns data to predefined classes |
| Prediction | Forecasts future values |
| Clustering | Groups similar data |
| Outlier Analysis | Finds abnormal data |
| Evolution Analysis | Studies trends over time |
Data Processing in Data Mining
Data processing is a step-by-step procedure to convert raw data into useful information.
Steps of Data Processing
| Step | Description |
|---|---|
| Data Collection | Gather data from sources |
| Data Cleaning | Remove errors & noise |
| Data Integration | Combine multiple sources |
| Data Transformation | Normalize or aggregate |
| Data Reduction | Reduce size without losing meaning |
| Data Mining | Apply mining algorithms |
| Evaluation | Validate patterns |
| Presentation | Show results using graphs/reports |
Data Pre-processing
Data Pre-processing prepares raw data for mining.
“Garbage in, Garbage out”
Poor data quality leads to poor mining results.
Forms of Data Pre-processing
| Form | Purpose |
|---|---|
| Data Cleaning | Remove errors & noise |
| Data Integration | Merge multiple datasets |
| Data Transformation | Normalize or scale data |
| Data Reduction | Reduce data size |
| Data Discretization | Convert continuous to discrete |
Data Cleaning
Data Cleaning removes incorrect, incomplete, or inconsistent data.
Common Data Problems
- Missing values
- Noisy data
- Duplicate records
- Inconsistent formats
Handling Missing Values
Missing values occur when data is not recorded properly.
Methods to Handle Missing Values
| Method | Explanation |
|---|---|
| Ignore the record | Remove data row |
| Manual filling | Fill by expert |
| Mean/Median | Replace with average |
| Most frequent | Replace with common value |
| Prediction | Use regression or ML |
Exam Tip: Mean method is most commonly used.
Noisy Data
Noisy data contains random errors or incorrect values.
Example: Age = 250 years
Methods to Handle Noisy Data
Binning
- Sort data
- Divide into bins
- Bin mean
- Bin median
- Bin boundaries
| Advantage | Simple and effective |
|---|
Clustering
- Groups similar data points
- Outliers are treated as noise
| Advantage | Works well for large datasets |
|---|
Regression
- Fits data into a mathematical function
- Smoothens noisy values
Example: Sales prediction using linear regression
Computer Inspection
- Automated programs detect noise
- Uses rules and algorithms
| Used when | Large datasets |
|---|
Human Inspection
- Experts manually examine data
- Time-consuming but accurate
| Used when | Critical datasets |
|---|
Comparison Table (Quick Revision)
| Technique | Used For |
|---|---|
| Binning | Smooth noisy data |
| Clustering | Identify outliers |
| Regression | Predict & smooth |
| Computer Inspection | Automated cleaning |
| Human Inspection | Manual verification |
Exam-Friendly Summary
- Data Mining extracts useful knowledge
- Motivated by huge data growth
- Data preprocessing is essential
- Data cleaning improves accuracy
- Missing values and noisy data must be handled properly
- Binning, clustering, and regression are key noise-handling techniques
Inconsistent Data
Inconsistent data occurs when the same data item has different values or formats in different places.
Examples
- Gender stored as M/F in one table and Male/Female in another
- Date formats: DD-MM-YYYY vs MM-DD-YYYY
- Sales total ≠ sum of item-wise sales
Causes
| Cause | Explanation |
|---|---|
| Multiple data sources | Different systems follow different rules |
| Data entry errors | Manual mistakes |
| Update anomalies | Partial updates |
| Different standards | Units, codes, formats |
Handling Inconsistent Data
| Method | Description |
|---|---|
| Standardization | Use common formats and codes |
| Constraint checking | Apply domain rules |
| Data validation | Verify against reference data |
| Data reconciliation | Resolve conflicts using rules |
Data Integration
Data Integration combines data from multiple heterogeneous sources into a single, unified dataset.
Issues in Data Integration
| Issue | Meaning |
|---|---|
| Schema integration | Different attribute names/types |
| Redundancy | Duplicate attributes or records |
| Entity identification | Same entity with different IDs |
| Value conflicts | Different values for same attribute |
Solutions
- Metadata management
- Data matching & deduplication
- Conflict resolution rules (source priority)
Data Transformation
Data Transformation converts data into a suitable format for mining.
Common Transformation Techniques
| Technique | Purpose |
|---|---|
| Normalization | Scale values to a range (0–1) |
| Aggregation | Summarize data (daily → monthly) |
| Attribute construction | Create new attributes |
| Encoding | Convert categorical to numeric |
| Smoothing | Remove noise |
Data Reduction (Overview)
Data Reduction reduces data size without losing important information, improving speed and efficiency.
Data Cube Aggregation
Data Cube Aggregation summarizes data across dimensions.
Example
-
Sales by Day → Month → Year
| Level | Example |
|---|---|
| Low | Daily sales |
| Medium | Monthly sales |
| High | Yearly sales |
Benefit: Faster OLAP queries
Dimensionality Reduction
Reduces the number of attributes (features).
Techniques
| Method | Explanation |
|---|---|
| Attribute selection | Remove irrelevant attributes |
| PCA (Principal Component Analysis) | Combine correlated attributes |
| Feature extraction | Create compact features |
Benefit: Less storage, faster mining, less noise
Data Compression
Data Compression stores data in compressed form.
| Type | Description |
|---|---|
| Lossless | No data loss (e.g., Run-Length Encoding) |
| Lossy | Some loss allowed (rare in mining) |
Benefit: Saves storage, faster I/O
Numerosity Reduction
Represents data using smaller models instead of raw data.
Methods
| Method | Description |
|---|---|
| Parametric | Regression models |
| Non-parametric | Histograms, clustering |
| Sampling | Random or stratified samples |
Benefit: Efficient for very large datasets
Discretization
Discretization converts continuous values into intervals.
Example: Age → {0–18, 19–35, 36–60, 60+}
Methods
| Method | Explanation |
|---|---|
| Equal-width | Same interval size |
| Equal-frequency | Same number of values per bin |
| Entropy-based | Uses information gain |
Benefit: Simplifies mining and improves interpretability
Concept Hierarchy Generation
A Concept Hierarchy organizes data from low-level detail to high-level abstraction.
Example
City → State → Country
Day → Month → Year
Generation Methods
| Method | Description |
|---|---|
| Schema-based | Defined by database schema |
| Set-grouping | Group similar values |
| Rule-based | Uses business rules |
Used in: Roll-up and Drill-down operations
Decision Tree
A Decision Tree is a classification and prediction model represented as a tree.
Structure
| Component | Meaning |
|---|---|
| Root node | First decision |
| Internal node | Attribute test |
| Leaf node | Class label |
How It Works
- Select best attribute (Information Gain / Gini Index)
- Split data recursively
- Stop when nodes are pure or criteria met
Advantages
| Advantage | Reason |
|---|---|
| Easy to understand | Visual and rule-based |
| Fast | Simple computations |
| Minimal preprocessing | Handles mixed data |
Disadvantages
| Issue | Reason |
|---|---|
| Overfitting | Too many splits |
| Instability | Small data change affects tree |
Quick Revision Table (Exam-Ready)
| Topic | Key Point |
|---|---|
| Inconsistent Data | Conflicting values/formats |
| Data Integration | Combine multiple sources |
| Data Transformation | Convert to mining-ready form |
| Data Cube Aggregation | Summarize multidimensional data |
| Dimensionality Reduction | Reduce attributes |
| Data Compression | Reduce storage |
| Numerosity Reduction | Model-based reduction |
| Discretization | Continuous → discrete |
| Concept Hierarchy | Detail → abstraction |
| Decision Tree | Classification model |
Exam-Friendly Conclusion
- Clean, integrated, and reduced data is essential for accurate mining
- Data reduction improves performance without losing meaning
- Discretization and concept hierarchies simplify analysis
- Decision trees are powerful and interpretable classification tools