Introduction to Compiler



Introduction to Compiler

A compiler is a system software that translates a program written in a high-level language (HLL) such as C, C++, or Java into machine-level or intermediate code that a computer can execute.

Key Functions of a Compiler

  • Translates source code into target code
  • Detects and reports errors
  • Optimizes the code for better performance

Compiler vs Interpreter

FeatureCompilerInterpreter
TranslationEntire program at onceLine-by-line
SpeedFaster executionSlower
Error reportingAfter compilationImmediate
ExampleC, C++Python, JavaScript

Phases of a Compiler

A compiler works in several sequential phases, each performing a specific task.


Compiler Phases

PhaseDescriptionOutput
Lexical AnalysisConverts characters into tokensTokens
Syntax AnalysisChecks grammar using parse treesParse Tree
Semantic AnalysisChecks meaning, types, declarationsAnnotated Tree
Intermediate Code GenerationProduces intermediate representationIR Code
Code OptimizationImproves performanceOptimized IR
Code GenerationProduces machine codeTarget Code

Passes of a Compiler

A pass refers to one complete traversal of the source code.

Type of CompilerDescription
Single-pass CompilerCode scanned once (fast, less optimization)
Multi-pass CompilerCode scanned multiple times (better optimization)

Example:
Most modern compilers like GCC are multi-pass compilers.

Bootstrapping of Compiler

Bootstrapping is the process of developing a compiler using the same language it compiles.

Steps in Bootstrapping

StepDescription
Step 1Write compiler in an existing language
Step 2Compile it using old compiler
Step 3Use newly generated compiler
Step 4Achieve self-hosting compiler

Advantages

  • Improves efficiency
  • Language becomes self-sustaining

Finite State Machines (FSM)

A Finite State Machine is a mathematical model used to recognize patterns.

Components of FSM

  • States
  • Input alphabet
  • Transition function
  • Start state
  • Final (accepting) state

FSM in Lexical Analysis

FSMs are used to identify tokens such as:

  • Identifiers
  • Keywords
  • Numbers

Regular Expressions (RE)

A Regular Expression defines a set of strings using specific rules.

Common Regular Expressions

PatternMeaning
a*Zero or more a’s
a+bOne or more a’s followed by b
[a-z]Lowercase letters
[0-9]+Numbers

Application in Lexical Analysis

  • Tokens are defined using regular expressions
  • Converted into NFA → DFA

Optimization of DFA-Based Pattern Matchers

Why DFA Optimization?

  • Reduces number of states
  • Improves execution speed
  • Saves memory

DFA Minimization Steps

StepDescription
Remove unreachable statesDelete unused states
Merge equivalent statesCombine similar behavior states
Create minimized DFAEfficient automaton

Implementation of Lexical Analyzer

A Lexical Analyzer (Scanner) reads the source code and converts it into tokens.

Functions

  • Removes white spaces
  • Recognizes lexemes
  • Produces tokens
  • Reports lexical errors

Token Structure

FieldDescription
Token NameIdentifier, Keyword
LexemeActual string
AttributePointer to symbol table

Lexical Analyzer Generator

A Lexical Analyzer Generator automatically generates a scanner from token definitions.

Examples

  • LEX
  • Flex

Benefits

FeatureAdvantage
Automatic generationSaves time
DFA-basedFast scanning
Error handlingBuilt-in

LEX Compiler (LEX Tool)

LEX is a lexical analyzer generator used with YACC.

Structure of LEX Program

SectionDescription
DefinitionHeader files, variables
RulesRegular expressions + actions
User CodeC functions

Example (Conceptual)

RegexToken
[a-zA-Z]+Identifier
[0-9]+Number

Working of LEX

  • User writes LEX specification
  • LEX generates lex.yy.c
  • Compiled using C compiler
  • Produces lexical analyzer

Exam-Oriented Summary (Quick Revision)

TopicKey Points
CompilerTranslates HLL to machine code
PhasesLexical → Syntax → Semantic → Code
PassesSingle or Multiple scans
BootstrappingCompiler written in its own language
FSM & REUsed to define tokens
DFA OptimizationImproves speed & memory
Lexical AnalyzerConverts input to tokens
LEXTool to generate lexical analyzer

Formal Grammars and Their Application to Syntax Analysis

A formal grammar is a mathematical model used to define the syntax (structure) of a programming language. It specifies how valid statements are formed.

Why Formal Grammars Are Needed

  • Define legal program structures
  • Help in syntax analysis (parsing)
  • Used by compiler design tools (YACC, ANTLR)

Application in Syntax Analysis

  • Grammar rules are used to check whether a sequence of tokens follows language rules
  • Parser constructs parse trees using grammar productions


Components of a Formal Grammar

A grammar G is defined as:

G = (V, T, P, S)

SymbolMeaning
VSet of non-terminals
TSet of terminals
PProduction rules
SStart symbol

BNF (Backus–Naur Form) Notation

BNF is a formal notation used to express grammar rules of programming languages.

Structure of BNF Rule

<non-terminal> ::= expression

Example (Arithmetic Expression)

<expr> ::= <expr> + <term> | <term> <term> ::= <term> * <factor> | <factor> <factor> ::= ( <expr> ) | id

BNF Symbols

SymbolMeaning
::=Definition
``
< >Non-terminal

Advantages of BNF

  • Simple and precise
  • Language independent
  • Used in language specifications

Ambiguity in Grammar

A grammar is said to be ambiguous if one string can have more than one parse tree.

Example of Ambiguous Grammar

EE + E | E * E | id

Expression: id + id * id

  • Can be evaluated in two different ways
  • Leads to confusion in parsing

Problems Caused by Ambiguity

  • Incorrect interpretation
  • Compiler confusion
  • Logical errors

Removing Ambiguity

  • Rewrite grammar with operator precedence
  • Use separate non-terminals

Context-Free Grammars (CFG)

A Context-Free Grammar is a grammar where each production has a single non-terminal on the left side.

General Form

A → α

Where:

  • A is a non-terminal
  • α is a string of terminals and non-terminals

Why CFG Is Important

  • Used to define syntax of programming languages
  • Suitable for nested structures
  • Used by most parsers

Derivation in CFG

Derivation shows how a string is generated from the start symbol.

Types of Derivation

TypeDescription
Leftmost DerivationLeftmost non-terminal replaced first
Rightmost DerivationRightmost non-terminal replaced first

Example

Grammar:

S → aSb | ab

Leftmost Derivation:

S ⇒ aSb ⇒ aab b

Parse Trees

A parse tree is a hierarchical representation of derivation.

Features

  • Root = Start symbol
  • Internal nodes = Non-terminals
  • Leaf nodes = Terminals

Importance

  • Visual understanding of syntax
  • Used in syntax-directed translation

Capabilities of Context-Free Grammars

CFGs are powerful enough to describe most programming constructs.

CFG Can Represent

FeatureExample
Nested expressions(a+b)*(c+d)
Conditional statementsif–else
Loop structureswhile, for
Arithmetic expressionsprecedence & associativity

CFG Limitations

  • Cannot check variable declaration before use
  • Cannot enforce type matching

YACC (Yet Another Compiler Compiler)

YACC is a parser generator that works with LEX.

Purpose

  • Generates syntax analyzer
  • Uses CFG rules as input

Structure of YACC Program

SectionDescription
DeclarationsTokens, precedence
Grammar RulesCFG productions
User CodeSupporting C code

Working of YACC

  • Grammar rules written in YACC
  • Generates y.tab.c
  • Compiled with LEX output
  • Produces parser

Syntax Specification of Programming Languages

Programming languages use CFG-based syntax specifications.

Example (If Statement)

if_stmt → if ( expr ) stmt

Why CFG for Syntax Specification

  • Clear structure
  • Machine readable
  • Standardized

Difference Between Lexical & Syntax Analysis

FeatureLexical AnalysisSyntax Analysis
InputCharactersTokens
OutputTokensParse tree
ToolLEXYACC
GrammarRegular ExpressionsCFG

Exam-Oriented Summary Table

TopicKey Points
Formal GrammarDefines language structure
BNFGrammar notation
AmbiguityMultiple parse trees
CFGCore of syntax analysis
DerivationString generation
Parse TreeHierarchical syntax
YACCParser generator