The model is based on Lewis Caroll's masterpiece Alice in Wonderlands from 1865. The text is found on the github page for the project. It is public domain meaning free to use as one see fit. If you haven't read it, go do so!
We are creating an $n$-gram model. The $n$ stands for how many consecutive words we take into account as base for the resulting statistical model. In the guide linked to 50 is used. We lower to 40 in order to increase accuracy. As in the guide we reach for 50%. Give or take. In order to create the model the text must first be parsed into sequences of 41 words (not 40 since we add the current word). This is done traversing the text from word 41 and to the last one. For each word we take the preceding 40 words and append making a sequence of 41 words.
Before we can create sequences, we need tokenize the text. This is just done using regular expressions. We concatenate words like rain-coat. Also we just remove symbols like ' within words. Like in Martin's jacket.
We separate the resulting sequences with a new line. And then we save the file. The whole sequencing program can be found in create_seqs.py. The saved sequences are found in models/alice_botta_40_seqs.txt.