Given some program, can I identify what lexemes (or tokens) are present in it. For example, consider this program in elden (yes, that is what I am calling it)
main() = {
if x == 4 {
print "x is equal to 4";
} else {
print "x is not equal to 4";
}
}
We need to identify that main is a keyword, there are parenthesis which contain arguments, left curly brace is a start of a function…and so on
We make a enum called Token to identify these “parts” of my program:
pub enum Token {
Delimter(Delimeter),
Operator(Operator),
Literal(Type),
Keyword(Keyword),
}
I have abstracted the details of the implementation for each, here is the overview:
- A delimiter includes braces, brackets, semicolon. Slightly quirky delimiter is the double quote, the value inside it will be a string literal that we need to process when we encounter it. Otherwise, we won’t know if it is a variable or a literal
- Operator includes add, subtract, multiply, equality, less than. Something to keep in mind when lexing: some operators are two characters like == or !=
- Literal is any identifier (variable), string, number that is used (hard coded) in the program
- Keywords are strings reserved to have a specific meaning and usage
How it works:
- We scan each type of character until we get something that is not the same type. For example, if I am parsing “let x = 5”: scan let (stop since ’ ’ is encountered) scan ’ ’ (stop since a character is encountered) scan x (stop since ’ ’ is encountered) scan ’ ’ (stop since = is encountered) …
- Once a specific token is scanned, we return (Token, remaining) from the function call. Then, we use the remaining to do the same thing. So:
Lexer("let x = 5") -> (Token(let), "x = 5")
Lexer("x = 5") -> (Token(x), "= 5")
Lexer("x = 5") -> (Token(x), "= 5")
Lexer("x = 5") -> (Token(x), "= 5")
We can collect the tokens and we have: Token(let), Token(x), Token(=), Token(5)
Example output of token from running lexer:
[Keyword(Main), Delimiter(LeftParen), Delimiter(RightParen), Operator(Equal),
Delimiter(LeftBrace), Keyword(If), Literal(Identifier("x")), Operator(EqualEqual),
Literal(Number(4)), Delimiter(LeftBrace), Keyword(Print),
Literal(String("x is equal to 4")), Delimiter(SemiColon), Delimiter(RightBrace),
Keyword(Else), Delimiter(LeftBrace), Keyword(Print),
Literal(String("x is not equal to 4")), Delimiter(SemiColon),
Delimiter(RightBrace), Delimiter(RightBrace)]