Okay, so you have written some cool program in a language and now it needs to run. Overall, a compiler “looks” at the code, generates some “intermediate representation” and then it is converted to code for a target machine

Overview: Program Frontend IR Backend Run

Lexical Analysis

Very cool word A lexer takes in a stream of characters from the programs and generates tokens. Example: let a = 10; would become: let, a, =, 10, ; after it passes a lexer

Parsing

After lexing, we want to make sure that the token that generated are valid and a part of a “grammar” A parsers taken in these stream of tokens and generates a tree that mirrors the rules of the language grammar.

Static Analysis

After we know the syntactic structure of the code, and we know it is valid

First step: Binding if c = a + b is valid syntax, then what is c? a? b? local or global? data type (probably won’t worry about type checking currently)

Now, for each of these “identifiers”, we see how it is defined and map those to each other.

Now, we need someplace to “store” all this analysis that we did. We can use:

  • Just use the syntax tree, add additional nodes (called attributes)
  • Use a symbol table, the keys are the identifiers and values are what those refer to

This was the frontend.

Intermediate Representations

Frontend organizes all the things we need to run the code, and backend is concerned with the target on which the code runs on; IR bridges both of them. It acts as an interface between both

Optimizations

Once we know the program, we can keep the same “meaning” but write it more efficiently

Some examples:

  • Constant Folding Evaluate the constant a = 3 * 5 * 6 * 7 + 4 - 2 can be rewritten as a = 632

  • Constant Propagation
    Replace variables with known constant values.
    Example:
    x = 5 y = x + 3 can be rewritten as y = 5 + 3

  • Common Subexpression Elimination (CSE) Eliminate redundant subexpressions a = b * c + d x = b * c + e can be rewritten as t = b * c a = t + d x = t + e

  • Loop Invariant Move the loop invariant for i = 0..n: x = a * b a[i] something

    can be rewritten as x = a * b for i = 0..n: a[i] something

  • Global value numbering (eliminates stuff CSE might not) Identifies equivalent expressions for redundancy

    w = 3 x = 3 y = x + 4 z = w + 4 Can change to w = 3 x = w y = w + 4 z = y

Code Generation

After everything is done, we have optimized code, we need to convert it into a form that the machine can actually run.

We can generate code to run on a native machine but then the compiler tied to a particular architecture.

Instead of instructions for some real chip, we can produce code for a hypothetical machine, generally called bytecode since each instruction is a byte long.

The basic principle here is that the farther down the pipeline you push the architecture specific work, the more of the earlier phases you can share across architectures

Running it: If we compiled it to machine code, we simply tell the operating system to load the executable. If we compiled it to bytecode, we need to start up the VM and load the program into that.