Dataset
The data set consists C-programs obtained from the GNU GCC repo (I'm somewhat sure) and the ID-sofware repo. Each program is of varying length. We call a tokenized C program a sample. A program limit of 5000 tokens have been included - we discard longer programs for now (not to mention empty programs). So the stats for the dataset is:
data processed, stats:
nr prgs : 387
avg len : 1634
var len : 1430
max len : 4947
min len : 9
discarded len : 144
Data Preparation
We tokenize each C program: we transform it into a list of tokens, now it is a sample. We store all samples in a json-file. This is done in process_data.py. Then we load them and split them into a training set and a test set. This is supervised learning: each sample is a pair of the original input and the same but rotated to the right and padded with an <eof> token. For the program described on the page before this we have original as:
@ident
@ident
(
)
{
@ident
(
@string
)
;
}
And we have rotated and padded as:
@ident
(
)
{
@ident
(
@string
)
;
}
<eof>
The original sample we see as a data point \( x \), that is the input of the network. The padded and rotated version we see as a target \( y \). This is the true value, what we want the prediction to match.
We have the following stats of the train/test data. Here the \( vocab \) is just a set of tokens that have been seen in the C files. For example we most likely have \( @num \in vocab \). Now:
vocab size : 118
dataset size : 387
train-set size : 310
test-set size : 77
train-set ratio : 0.8010335917312662
Cross validation has been used. So the training set has been divided into \( k = 10 \): 9/10 for training and 1/10 for validation. We have the following split sizes for each epoch:
train_split length : 279 val_split length : 31Make a note
One thing to note about this parser: it is bound to learn the coding style of the coder - at least to some degree. Looking through the original programs I have noticed a lot of one line functions. These contains one return statement and nothing else. Furthermore a lot of #include and #define statements are in the top of each program. This might have impact on the parser. It might develop a believe that functions are most likely one liners. This is left as a note. One can just use a different dataset with a different coding style to have a different parser.