# Dataset

## 02.02.2022

### Contents/Index

Introduction
Lexical Analysis
@Dataset
Neural Network
Complete Me

The data set consists C-programs obtained from the GNU GCC repo (I'm somewhat sure) and the ID-sofware repo. Each program is of varying length. We call a tokenized C program a sample. A program limit of 5000 tokens have been included - we discard longer programs for now (not to mention empty programs). So the stats for the dataset is:

data processed, stats: nr prgs : 387 avg len : 1634 var len : 1430 max len : 4947 min len : 9 discarded len : 144

## Data Preparation

We tokenize each C program: we transform it into a list of tokens, now it is a sample. We store all samples in a json-file. This is done in process_data.py. Then we load them and split them into a training set and a test set. This is supervised learning: each sample is a pair of the original input and the same but rotated to the right and padded with an <eof> token. For the program described on the page before this we have original as:

@ident @ident ( ) { @ident ( @string ) ; }

And we have rotated and padded as:

@ident ( ) { @ident ( @string ) ; } <eof>

The original sample we see as a data point $x$, that is the input of the network. The padded and rotated version we see as a target $y$. This is the true value, what we want the prediction to match.

We have the following stats of the train/test data. Here the $vocab$ is just a set of tokens that have been seen in the C files. For example we most likely have $@num \in vocab$. Now:

vocab size : 118 dataset size : 387 train-set size : 310 test-set size : 77 train-set ratio : 0.8010335917312662

Cross validation has been used. So the training set has been divided into $k = 10$: 9/10 for training and 1/10 for validation. We have the following split sizes for each epoch:

train_split length : 279 val_split length : 31

## Make a note

One thing to note about this parser: it is bound to learn the coding style of the coder - at least to some degree. Looking through the original programs I have noticed a lot of one line functions. These contains one return statement and nothing else. Furthermore a lot of #include and #define statements are in the top of each program. This might have impact on the parser. It might develop a believe that functions are most likely one liners. This is left as a note. One can just use a different dataset with a different coding style to have a different parser.