Introduction to Compiler
Introduction to Compiler
A compiler is a system software that translates a program written in a high-level language (HLL) such as C, C++, or Java into machine-level or intermediate code that a computer can execute.
Key Functions of a Compiler
- Translates source code into target code
- Detects and reports errors
- Optimizes the code for better performance
Compiler vs Interpreter
| Feature | Compiler | Interpreter |
|---|---|---|
| Translation | Entire program at once | Line-by-line |
| Speed | Faster execution | Slower |
| Error reporting | After compilation | Immediate |
| Example | C, C++ | Python, JavaScript |
Phases of a Compiler
A compiler works in several sequential phases, each performing a specific task.
Compiler Phases
| Phase | Description | Output |
|---|---|---|
| Lexical Analysis | Converts characters into tokens | Tokens |
| Syntax Analysis | Checks grammar using parse trees | Parse Tree |
| Semantic Analysis | Checks meaning, types, declarations | Annotated Tree |
| Intermediate Code Generation | Produces intermediate representation | IR Code |
| Code Optimization | Improves performance | Optimized IR |
| Code Generation | Produces machine code | Target Code |
Passes of a Compiler
A pass refers to one complete traversal of the source code.
| Type of Compiler | Description |
|---|---|
| Single-pass Compiler | Code scanned once (fast, less optimization) |
| Multi-pass Compiler | Code scanned multiple times (better optimization) |
Example:
Most modern compilers like GCC are multi-pass compilers.
Bootstrapping of Compiler
Bootstrapping is the process of developing a compiler using the same language it compiles.
Steps in Bootstrapping
| Step | Description |
|---|---|
| Step 1 | Write compiler in an existing language |
| Step 2 | Compile it using old compiler |
| Step 3 | Use newly generated compiler |
| Step 4 | Achieve self-hosting compiler |
Advantages
- Improves efficiency
- Language becomes self-sustaining
Finite State Machines (FSM)
A Finite State Machine is a mathematical model used to recognize patterns.
Components of FSM
- States
- Input alphabet
- Transition function
- Start state
- Final (accepting) state
FSM in Lexical Analysis
FSMs are used to identify tokens such as:
- Identifiers
- Keywords
- Numbers
Regular Expressions (RE)
A Regular Expression defines a set of strings using specific rules.
Common Regular Expressions
| Pattern | Meaning |
|---|---|
a* | Zero or more a’s |
a+b | One or more a’s followed by b |
[a-z] | Lowercase letters |
[0-9]+ | Numbers |
Application in Lexical Analysis
- Tokens are defined using regular expressions
- Converted into NFA → DFA
Optimization of DFA-Based Pattern Matchers
Why DFA Optimization?
- Reduces number of states
- Improves execution speed
- Saves memory
DFA Minimization Steps
| Step | Description |
|---|---|
| Remove unreachable states | Delete unused states |
| Merge equivalent states | Combine similar behavior states |
| Create minimized DFA | Efficient automaton |
Implementation of Lexical Analyzer
A Lexical Analyzer (Scanner) reads the source code and converts it into tokens.
Functions
- Removes white spaces
- Recognizes lexemes
- Produces tokens
- Reports lexical errors
Token Structure
| Field | Description |
|---|---|
| Token Name | Identifier, Keyword |
| Lexeme | Actual string |
| Attribute | Pointer to symbol table |
Lexical Analyzer Generator
A Lexical Analyzer Generator automatically generates a scanner from token definitions.
Examples
- LEX
- Flex
Benefits
| Feature | Advantage |
|---|---|
| Automatic generation | Saves time |
| DFA-based | Fast scanning |
| Error handling | Built-in |
LEX Compiler (LEX Tool)
LEX is a lexical analyzer generator used with YACC.
Structure of LEX Program
| Section | Description |
|---|---|
| Definition | Header files, variables |
| Rules | Regular expressions + actions |
| User Code | C functions |
Example (Conceptual)
| Regex | Token |
|---|---|
[a-zA-Z]+ | Identifier |
[0-9]+ | Number |
Working of LEX
- User writes LEX specification
- LEX generates
lex.yy.c - Compiled using C compiler
- Produces lexical analyzer
Exam-Oriented Summary (Quick Revision)
| Topic | Key Points |
|---|---|
| Compiler | Translates HLL to machine code |
| Phases | Lexical → Syntax → Semantic → Code |
| Passes | Single or Multiple scans |
| Bootstrapping | Compiler written in its own language |
| FSM & RE | Used to define tokens |
| DFA Optimization | Improves speed & memory |
| Lexical Analyzer | Converts input to tokens |
| LEX | Tool to generate lexical analyzer |
Formal Grammars and Their Application to Syntax Analysis
A formal grammar is a mathematical model used to define the syntax (structure) of a programming language. It specifies how valid statements are formed.
Why Formal Grammars Are Needed
- Define legal program structures
- Help in syntax analysis (parsing)
- Used by compiler design tools (YACC, ANTLR)
Application in Syntax Analysis
- Grammar rules are used to check whether a sequence of tokens follows language rules
- Parser constructs parse trees using grammar productions
Components of a Formal Grammar
A grammar G is defined as:
G = (V, T, P, S)
| Symbol | Meaning |
|---|---|
| V | Set of non-terminals |
| T | Set of terminals |
| P | Production rules |
| S | Start symbol |
BNF (Backus–Naur Form) Notation
BNF is a formal notation used to express grammar rules of programming languages.
Structure of BNF Rule
Example (Arithmetic Expression)
BNF Symbols
| Symbol | Meaning |
|---|---|
::= | Definition |
| ` | ` |
< > | Non-terminal |
Advantages of BNF
- Simple and precise
- Language independent
- Used in language specifications
Ambiguity in Grammar
A grammar is said to be ambiguous if one string can have more than one parse tree.
Example of Ambiguous Grammar
Expression: id + id * id
- Can be evaluated in two different ways
- Leads to confusion in parsing
Problems Caused by Ambiguity
- Incorrect interpretation
- Compiler confusion
- Logical errors
Removing Ambiguity
- Rewrite grammar with operator precedence
- Use separate non-terminals
Context-Free Grammars (CFG)
A Context-Free Grammar is a grammar where each production has a single non-terminal on the left side.
General Form
Where:
Ais a non-terminalαis a string of terminals and non-terminals
Why CFG Is Important
- Used to define syntax of programming languages
- Suitable for nested structures
- Used by most parsers
Derivation in CFG
Derivation shows how a string is generated from the start symbol.
Types of Derivation
| Type | Description |
|---|---|
| Leftmost Derivation | Leftmost non-terminal replaced first |
| Rightmost Derivation | Rightmost non-terminal replaced first |
Example
Grammar:
Leftmost Derivation:
Parse Trees
A parse tree is a hierarchical representation of derivation.
Features
- Root = Start symbol
- Internal nodes = Non-terminals
- Leaf nodes = Terminals
Importance
- Visual understanding of syntax
- Used in syntax-directed translation
Capabilities of Context-Free Grammars
CFGs are powerful enough to describe most programming constructs.
CFG Can Represent
| Feature | Example |
|---|---|
| Nested expressions | (a+b)*(c+d) |
| Conditional statements | if–else |
| Loop structures | while, for |
| Arithmetic expressions | precedence & associativity |
CFG Limitations
- Cannot check variable declaration before use
- Cannot enforce type matching
YACC (Yet Another Compiler Compiler)
YACC is a parser generator that works with LEX.
Purpose
- Generates syntax analyzer
- Uses CFG rules as input
Structure of YACC Program
| Section | Description |
|---|---|
| Declarations | Tokens, precedence |
| Grammar Rules | CFG productions |
| User Code | Supporting C code |
Working of YACC
- Grammar rules written in YACC
- Generates
y.tab.c - Compiled with LEX output
- Produces parser
Syntax Specification of Programming Languages
Programming languages use CFG-based syntax specifications.
Example (If Statement)
Why CFG for Syntax Specification
- Clear structure
- Machine readable
- Standardized
Difference Between Lexical & Syntax Analysis
| Feature | Lexical Analysis | Syntax Analysis |
|---|---|---|
| Input | Characters | Tokens |
| Output | Tokens | Parse tree |
| Tool | LEX | YACC |
| Grammar | Regular Expressions | CFG |
Exam-Oriented Summary Table
| Topic | Key Points |
|---|---|
| Formal Grammar | Defines language structure |
| BNF | Grammar notation |
| Ambiguity | Multiple parse trees |
| CFG | Core of syntax analysis |
| Derivation | String generation |
| Parse Tree | Hierarchical syntax |
| YACC | Parser generator |