Neural Network

Neural Networks

NNs have been used to deal with among other things natural language processing. A task within this field is to construct a language model. In essence we for each word in an input text predict the most likely next word. A variety of different types of neural networks can be used for this. For now we just concentrate on the kind called recurrent neural networks. Of these we use a LSTM (Long Short Term Memory) network. The task of next word prediction is exactly what we want to achieve - we are just building an LSTM language model. Here using PyTorch.

Training

An LSTM has a memory which is kind of dynamic. It is both short term and long term. And what not. It lives in a set of hidden states. We have the opportunity to reset the hidden states when we see fit. For a language model you would probably reset after a sentence or a short text, maybe a paragraph. Depending on what is possible. Here I have chosen to reset after each sample/C-program. Maybe it would be better to reset after a statement. However this is not possible in C without doing some parsing since simple lexical analysis has a hard time keeping track of nested statements.

We have a vocabulary with a size, that is number of distinct tokens. Some special tokens have been added, for example the eof token. Furthermore in order for this network to scale we need an unk token. A few more special tokens have been added, for example pad. These are not needed for now.

Now each token is transformed into an embedding - the number of different embeddings is the size of the vocabulary.

For now the batch size is set to 1. So we do not have to pad sequences. I have not constructed the network so the batch size can be altered. This requires some tinkering. A single epoch takes 50 or so seconds. So I haven't bothered looking into batches.

The network is build of the following described layers:

Input is send through an embedding layer of dim 128. That is the input is transformed from dim of vocab size into dim 128. Embedding layers are used if the tokens might have similarities in some way. For example ( and { - these two have something in common. An embedding layer tries to learn commonalities across tokens.
The embedding layer output is send through two LSTM layers of dim 256 with a drop out of 0.2.
Then a linear transformation is used to transform the output of the LSTM layer back into a dimension matching the vocab size.
Lastly this linear transformation is activated by log_softmax. Note that this output is a vector of length of vocab size. The index of the largest value in this vector is the most likely following token. However the second largest is the second most likely. And so on.

Validation

We do validation after each training run. We have a cross split of size 10. We calculate accuracy: the ratio of correct predictions versus the total number of predictions. However a 3-accuracy has been added. Here we just check for some correct \( y \) whether this is in the top 3 predictions - remember that the model returns a vector of most probable next tokens. So we can sort this in order to obtain the \( n \) most likely.

The reason for 3-accuracy is that depending on the style of the C-coder a sample can variate a lot in length. A C-function can be short or long. Are functions with few statements more common than those with many? We can state so by looking in the dataset, but can we generalize? So it can be kind of hard to tell the exact next token. Instead we opt for suggestions.

The program

You can do runs using the file run_model.py - for example by writing python3 in front in the Linux terminal. The file prompts a dialog asking whether we should train, test or validate.

The current training status is found in the file cnn_model5.acc.txt. A run of the test set currently gives:

********eval on test set test set size : 77 accuracy : 0.6686517073170731 accuracy3 : 0.9092058536585366 took : 33.67 seconds