Dataset

The data set consists C-programs obtained from the GNU GCC repo (I'm somewhat sure) and the ID-sofware repo. Each program is of varying length. We call a tokenized C program a sample. A program limit of 5000 tokens have been included - we discard longer programs for now (not to mention empty programs). So the stats for the dataset is:

data processed, stats:
  nr prgs       : 387
  avg len       : 1634
  var len       : 1430
  max len       : 4947
  min len       : 9
  discarded len : 144

Data Preparation

We tokenize each C program: we transform it into a list of tokens, now it is a sample. We store all samples in a json-file. This is done in process_data.py. Then we load them and split them into a training set and a test set. This is supervised learning: each sample is a pair of the original input and the same but rotated to the right and padded with an <eof> token. For the program described on the page before this we have original as:

@ident
@ident
(
)
{
@ident
(
@string
)
;
}

And we have rotated and padded as:

@ident
(
)
{
@ident
(
@string
)
;
}
<eof>

The original sample we see as a data point \( x \), that is the input of the network. The padded and rotated version we see as a target \( y \). This is the true value, what we want the prediction to match.

We have the following stats of the train/test data. Here the \( vocab \) is just a set of tokens that have been seen in the C files. For example we most likely have \( @num \in vocab \). Now:

vocab size      : 118
dataset size    : 387
train-set size  : 310
test-set size   : 77
train-set ratio : 0.8010335917312662

Cross validation has been used. So the training set has been divided into \( k = 10 \): 9/10 for training and 1/10 for validation. We have the following split sizes for each epoch:

train_split length : 279 val_split length : 31

Make a note

One thing to note about this parser: it is bound to learn the coding style of the coder - at least to some degree. Looking through the original programs I have noticed a lot of one line functions. These contains one return statement and nothing else. Furthermore a lot of #include and #define statements are in the top of each program. This might have impact on the parser. It might develop a believe that functions are most likely one liners. This is left as a note. One can just use a different dataset with a different coding style to have a different parser.