Data Mining



Overview of Data Mining

In today’s digital world, organizations store huge amounts of data in databases, data warehouses, and cloud systems. However, raw data alone is not useful unless we extract hidden patterns, trends, and knowledge from it.

Data Mining is the process of discovering useful information from large datasets.

In simple words: Data Mining = Finding valuable patterns from large data

Motivation for Data Mining

Why do we need data mining?

ReasonExplanation
Huge data growthManual analysis is impossible
Business competitionBetter decisions give advantage
AutomationComputers find patterns faster
Hidden knowledgePatterns not visible to humans
PredictionForecast future trends

Real-Life Examples

  • Banks detect fraud transactions
  • E-commerce recommends products
  • Colleges analyze student performance
  • Hospitals predict disease risks

Definition of Data Mining

Data Mining is the process of extracting interesting, non-trivial, previously unknown, and useful patterns or knowledge from large datasets.

Key Terms Explained

TermMeaning
InterestingUseful for decision making
Non-trivialNot obvious
Large datasetsHuge volume of data
KnowledgePatterns, rules, predictions

Functionalities of Data Mining

These are the main tasks performed in data mining.

FunctionalityDescription
CharacterizationSummarizes data features
DiscriminationCompares different classes
AssociationFinds relationships (e.g., market basket)
ClassificationAssigns data to predefined classes
PredictionForecasts future values
ClusteringGroups similar data
Outlier AnalysisFinds abnormal data
Evolution AnalysisStudies trends over time

Data Processing in Data Mining

Data processing is a step-by-step procedure to convert raw data into useful information.

Steps of Data Processing

StepDescription
Data CollectionGather data from sources
Data CleaningRemove errors & noise
Data IntegrationCombine multiple sources
Data TransformationNormalize or aggregate
Data ReductionReduce size without losing meaning
Data MiningApply mining algorithms
EvaluationValidate patterns
PresentationShow results using graphs/reports

Data Pre-processing

Data Pre-processing prepares raw data for mining.

“Garbage in, Garbage out”
Poor data quality leads to poor mining results.

Forms of Data Pre-processing

FormPurpose
Data CleaningRemove errors & noise
Data IntegrationMerge multiple datasets
Data TransformationNormalize or scale data
Data ReductionReduce data size
Data DiscretizationConvert continuous to discrete

Data Cleaning

Data Cleaning removes incorrect, incomplete, or inconsistent data.

Common Data Problems

  • Missing values
  • Noisy data
  • Duplicate records
  • Inconsistent formats

Handling Missing Values

Missing values occur when data is not recorded properly.

Methods to Handle Missing Values

MethodExplanation
Ignore the recordRemove data row
Manual fillingFill by expert
Mean/MedianReplace with average
Most frequentReplace with common value
PredictionUse regression or ML

Exam Tip: Mean method is most commonly used.

Noisy Data

Noisy data contains random errors or incorrect values.

Example: Age = 250 years 

Methods to Handle Noisy Data

Binning

  • Sort data
  • Divide into bins
Replace values using:
  • Bin mean
  • Bin median
  • Bin boundaries

AdvantageSimple and effective

Clustering

  • Groups similar data points
  • Outliers are treated as noise

AdvantageWorks well for large datasets

Regression

  • Fits data into a mathematical function
  • Smoothens noisy values

Example: Sales prediction using linear regression

Computer Inspection

  • Automated programs detect noise
  • Uses rules and algorithms

Used whenLarge datasets

Human Inspection

  • Experts manually examine data
  • Time-consuming but accurate

Used whenCritical datasets

Comparison Table (Quick Revision)

TechniqueUsed For
BinningSmooth noisy data
ClusteringIdentify outliers
RegressionPredict & smooth
Computer InspectionAutomated cleaning
Human InspectionManual verification

Exam-Friendly Summary

  • Data Mining extracts useful knowledge
  • Motivated by huge data growth
  • Data preprocessing is essential
  • Data cleaning improves accuracy
  • Missing values and noisy data must be handled properly
  • Binning, clustering, and regression are key noise-handling techniques

Inconsistent Data

Inconsistent data occurs when the same data item has different values or formats in different places.

Examples

  • Gender stored as M/F in one table and Male/Female in another
  • Date formats: DD-MM-YYYY vs MM-DD-YYYY
  • Sales total ≠ sum of item-wise sales

Causes

CauseExplanation
Multiple data sourcesDifferent systems follow different rules
Data entry errorsManual mistakes
Update anomaliesPartial updates
Different standardsUnits, codes, formats

Handling Inconsistent Data

MethodDescription
StandardizationUse common formats and codes
Constraint checkingApply domain rules
Data validationVerify against reference data
Data reconciliationResolve conflicts using rules

Data Integration

Data Integration combines data from multiple heterogeneous sources into a single, unified dataset.

Issues in Data Integration

IssueMeaning
Schema integrationDifferent attribute names/types
RedundancyDuplicate attributes or records
Entity identificationSame entity with different IDs
Value conflictsDifferent values for same attribute

Solutions

  • Metadata management
  • Data matching & deduplication
  • Conflict resolution rules (source priority)

Data Transformation

Data Transformation converts data into a suitable format for mining.

Common Transformation Techniques

TechniquePurpose
NormalizationScale values to a range (0–1)
AggregationSummarize data (daily → monthly)
Attribute constructionCreate new attributes
EncodingConvert categorical to numeric
SmoothingRemove noise

Data Reduction (Overview)

Data Reduction reduces data size without losing important information, improving speed and efficiency.

Data Cube Aggregation

Data Cube Aggregation summarizes data across dimensions.

Example

  • Sales by Day → Month → Year

LevelExample
LowDaily sales
MediumMonthly sales
HighYearly sales

Benefit: Faster OLAP queries

Dimensionality Reduction

Reduces the number of attributes (features).

Techniques

MethodExplanation
Attribute selectionRemove irrelevant attributes
PCA (Principal Component Analysis)Combine correlated attributes
Feature extractionCreate compact features

Benefit: Less storage, faster mining, less noise

Data Compression

Data Compression stores data in compressed form.

TypeDescription
LosslessNo data loss (e.g., Run-Length Encoding)
LossySome loss allowed (rare in mining)

Benefit: Saves storage, faster I/O

Numerosity Reduction

Represents data using smaller models instead of raw data.

Methods

MethodDescription
ParametricRegression models
Non-parametricHistograms, clustering
SamplingRandom or stratified samples

Benefit: Efficient for very large datasets

Discretization

Discretization converts continuous values into intervals.

Example: Age → {0–18, 19–35, 36–60, 60+}

Methods

MethodExplanation
Equal-widthSame interval size
Equal-frequencySame number of values per bin
Entropy-basedUses information gain

Benefit: Simplifies mining and improves interpretability

Concept Hierarchy Generation

A Concept Hierarchy organizes data from low-level detail to high-level abstraction.

Example

City → State → Country
Day → Month → Year

Generation Methods

MethodDescription
Schema-basedDefined by database schema
Set-groupingGroup similar values
Rule-basedUses business rules

Used in: Roll-up and Drill-down operations

Decision Tree

A Decision Tree is a classification and prediction model represented as a tree.

Structure

ComponentMeaning
Root nodeFirst decision
Internal nodeAttribute test
Leaf nodeClass label

How It Works

  • Select best attribute (Information Gain / Gini Index)
  • Split data recursively
  • Stop when nodes are pure or criteria met

Advantages

AdvantageReason
Easy to understandVisual and rule-based
FastSimple computations
Minimal preprocessingHandles mixed data

Disadvantages

IssueReason
OverfittingToo many splits
InstabilitySmall data change affects tree

Quick Revision Table (Exam-Ready)

TopicKey Point
Inconsistent DataConflicting values/formats
Data IntegrationCombine multiple sources
Data TransformationConvert to mining-ready form
Data Cube AggregationSummarize multidimensional data
Dimensionality ReductionReduce attributes
Data CompressionReduce storage
Numerosity ReductionModel-based reduction
DiscretizationContinuous → discrete
Concept HierarchyDetail → abstraction
Decision TreeClassification model

Exam-Friendly Conclusion

  • Clean, integrated, and reduced data is essential for accurate mining
  • Data reduction improves performance without losing meaning
  • Discretization and concept hierarchies simplify analysis
  • Decision trees are powerful and interpretable classification tools