Introduction

02.02.2022

Contents/Index

@Introduction
Lexical Analysis
Dataset
Neural Network
Complete Me

In this project: Let's use PyTorch to build a neural network that learns the parser of the C programming language. Introduce: Cnn Parser. Normally a parser is build using a generator. The input of such a generator is a somewhat formal syntax, the output is a parser. Normally the resulting parser is table driven. It takes an input program, a source text, and by looking up values in this table it either outputs some kind of structure, or it fails with a more or less specific syntax error.

Instead of building a parser this way we can build a neural network that learns the grammar of some language, here C. An intriguing thing about neural networks is that we do not really have to understand the deeper aspects of what we are learning. We let the computer figure out the connections. Hence this parser we build is more general. If you want to build a parser for a different language, you can just train the same parser on a different data set. On the downside this kind of parser is empiric - it can only say whether some next token is the most likely given what the parser has been trained on.

Normally a parser takes a source text as input and returns a structure (syntax tree) or an error. However we are about to learn the syntax instead. This lets us predict the next token given an unfinished source text. In computer science we have the field of program synthesis. Here we are concerned with the automation of programming tasks. Probably the most widespread subject within program synthesis is that of auto complete. Every one who has used an IDE, knows how it can assist the programmer by suggestions of the next piece of code in the program. An auto complete that blindly suggest any of the tokens you have used before in the program you are writing, does not sound very helpful. Instead the suggestion can be based on the syntax of the underlying programming language, type theory, statistics of tokens already used and so on. So Cnn Parser can actually be used for something. I'm not quite sure what the state of the art within auto completion is. It is probably a mixture of different algorithms. I have for long wanted to hack a parser generator in order to make it suggest possible classes for the next part of a given program. Instead we can just learn the grammar.

For those who wants to jump right ahead: the git repository.