Lexical Analysis

Parsing a source text, \( src_{txt} \), into a structure consists of two steps. Normally, that is. First we split \( src_{txt} \) into tokens, and then we parse these tokens in order to obtain a structure - normally a tree. Here the parsing procedure differs. However the lexical analysis is almost the same. Let's have a look. For example given the C program

int main(){ print("Hello, world!\n"); }

We probably would split it into something like

int main ( ) { print ( "Hello, world\n" ) ; }

Here each new line is a new token. Since the resulting parser is limited in how many distinct tokens we can represent, we need some kind of type for each token. It could look something like

@ident @ident ( ) { @ident ( @string ) ; }

@ident stands for identifier. Normally we would have to store the string value, the name of the type or the variable identifier since we need those later. However with this parser we only care about the type of next token, hence we can discard all this information.

For this parser the following tokens have been chosen:

The lexer can be found in lexer.py.

Lexical Granularity

It could have been nice to make a distinction between types and identifiers. With one such the network would have more to learn. However the learning process might become easier if there were more structure to rely on. But we can't make that distinction since in C types and other identifiers differ by syntactical position. In a language like Haskell or Fsharp types are written with capital starting letter. So here the lexer can label identifiers.

Share