Okay, so you have written some cool program in a language and now it needs to run. Overall, a compiler “looks” at the code, generates some “intermediate representation” and then it is converted to code for a target machine
Overview: Program → Frontend → IR → Backend → Run
Lexical Analysis
Very cool word A lexer takes in a stream of characters from the programs and generates tokens. Example: let a = 10; would become: let, a, =, 10, ; after it passes a lexer
Parsing
After lexing, we want to make sure that the token that generated are valid and a part of a “grammar” A parsers taken in these stream of tokens and generates a tree that mirrors the rules of the language grammar.
Static Analysis
After we know the syntactic structure of the code, and we know it is valid
First step: Binding if c = a + b is valid syntax, then what is c? a? b? local or global? data type (probably won’t worry about type checking currently)
Now, for each of these “identifiers”, we see how it is defined and map those to each other.
Now, we need someplace to “store” all this analysis that we did. We can use:
- Just use the syntax tree, add additional nodes (called attributes)
- Use a symbol table, the keys are the identifiers and values are what those refer to
This was the frontend.
Intermediate Representations
Frontend organizes all the things we need to run the code, and backend is concerned with the target on which the code runs on; IR bridges both of them. It acts as an interface between both
Optimizations
Once we know the program, we can keep the same “meaning” but write it more efficiently
Some examples:
-
Constant Folding Evaluate the constant a = 3 * 5 * 6 * 7 + 4 - 2 can be rewritten as a = 632
-
Constant Propagation
Replace variables with known constant values.
Example:
x = 5 y = x + 3 can be rewritten as y = 5 + 3 -
Common Subexpression Elimination (CSE) Eliminate redundant subexpressions a = b * c + d x = b * c + e can be rewritten as t = b * c a = t + d x = t + e
-
Loop Invariant Move the loop invariant for i = 0..n: x = a * b a[i] something
can be rewritten as x = a * b for i = 0..n: a[i] something
-
Global value numbering (eliminates stuff CSE might not) Identifies equivalent expressions for redundancy
w = 3 x = 3 y = x + 4 z = w + 4 Can change to w = 3 x = w y = w + 4 z = y
Code Generation
After everything is done, we have optimized code, we need to convert it into a form that the machine can actually run.
We can generate code to run on a native machine but then the compiler tied to a particular architecture.
Instead of instructions for some real chip, we can produce code for a hypothetical machine, generally called bytecode since each instruction is a byte long.
The basic principle here is that the farther down the pipeline you push the architecture specific work, the more of the earlier phases you can share across architectures
Running it: If we compiled it to machine code, we simply tell the operating system to load the executable. If we compiled it to bytecode, we need to start up the VM and load the program into that.