Lexical Analysis

02.02.2022

Contents/Index

Introduction
@Lexical Analysis
Dataset
Neural Network
Complete Me

Parsing a source text, $src_{txt}$, into a structure consists of two steps. Normally, that is. First we split $src_{txt}$ into tokens, and then we parse these tokens in order to obtain a structure - normally a tree. Here the parsing procedure differs. However the lexical analysis is almost the same. Let's have a look. For example given the C program

int main(){ print("Hello, world!\n"); }

We probably would split it into something like

int main ( ) { print ( "Hello, world\n" ) ; }

Here each new line is a new token. Since the resulting parser is limited in how many distinct tokens we can represent, we need some kind of type for each token. It could look something like

@ident @ident ( ) { @ident ( @string ) ; }

@ident stands for identifier. Normally we would have to store the string value, the name of the type or the variable identifier since we need those later. However with this parser we only care about the type of next token, hence we can discard all this information.

For this parser the following tokens have been chosen:

• Keywords are left as is. For example we have the token int
• @string for strings, that is anything in double quotes.
• @char for chars, that is anything in single quotes.
• @croc-dir for crocodile directive. Like in includes with <file.txt>.
• @num for numbers, that is both decimal and whole numbers.
• #macro for macros or preprocessor instructions or what they are called. We just leave them, for example include is turned into the token #include.
• Operators and parentheses are left as is.

The lexer can be found in lexer.py.

Lexical Granularity

It could have been nice to make a distinction between types and identifiers. With one such the network would have more to learn. However the learning process might become easier if there were more structure to rely on. But we can't make that distinction since in C types and other identifiers differ by syntactical position. In a language like Haskell or Fsharp types are written with capital starting letter. So here the lexer can label identifiers.