# Basic N-Gram Model in Python

## 18.05.2020

I have had several encounters with language models in the past few months. So why not get into the idea behind. Here's the basic with a python implementation.

### The theory

We are looking into language models of the n-gram type. We consider word based models. Another type is character based. So given some sequence of $n$ words we consider what the next word can be. That is the probability given as $$p(w_i | w_{i - (n - 1)} ... w_{i - 1})$$ That is given a sequence of words $w_{i - (n - 1)} ... w_{i - 1}$ we want to know which words follows with what probability. Furthermore we simplify the problem: instead of looking at the text as a whole, for each word we are only interested in the preceding $n - 1$ words. This is called Markov Property. It can be seen as a state machine where we history wise are only interested in a limited amount of preceding states.

### Corpus

We use Alice in Wonderlands. The text can be found here (alice.txt file). We need to manipulate the file into tokens and then into sequences of $n$ words.

import re # load file def load_corpus(fname): f = open("corpus/" + fname,"r") res = f.read() f.close() return res # create a list of tokens def create_tokens(txt): # replace special chars txt = txt.replace("&","and") # join words into one word if seperated by one of txt = txt.replace("-","") txt = txt.replace("'","") txt = txt.replace("’","") # allowed alphabet for words rx = re.compile("[a-zA-Z0-9]+") t = rx.findall(txt) return [word.lower() for word in t] # create a list with lists of sequences def create_seqs(n,tokens): res = [] for i in range(n - 1,len(tokens)): s0 = [] for j in range(i + 1 - n,i + 1): s0.append(tokens[j]) res.append(s0) return res

Try that on alice.txt. The sequences are for each word and then the preceding $n$ words. For the text "one two three four" and $n = 2$ we get

"one","two" "two","three" "three","four"

### Creating the model

To make things more concrete consider $n = 3$. We will use this when building the model. So given a sequence of two words $s_1 = w_1 . w_2$ we want to know the probability that a word $w_3$ follows $s_1$. This is simply done as $$\frac{count(w_1 . w_2 . w_3)}{count(w_1 . w_2)}$$ where $count$ counts the number of occurrences of the given sequence in the text. Now we build a model, a python dictionary that has a tuple of $(w_1,w_2)$ as key and a dictionary as value. This value has $w_3$ as key. And the frequency of the sequence $w_1 . w_2 . w_3$ as value. Lastly we turn this frequency into probabilities by dividing with the frequency of the sequence $w_1 . w_2$:

def create_model3(tokens): tris = create_seqs(3,tokens) res = {} # count the number of w3 given w1.w2 for w1,w2,w3 in tris: if not (w1,w2) in res: res[(w1,w2)] = {} if not w3 in res[(w1,w2)]: res[(w1,w2)][w3] = 0 res[(w1,w2)][w3] += 1 # turn into probabilities for (seq,v) in res.items(): seq_freq = float(sum(v.values())) for word in v.keys(): res[seq][word] /= seq_freq return res

Note that $count(w_1 . w_2) = count(w_1 . w_2 . w_i)$ for all possible $w_i$. And that is it. We can run the model with the following function

def compute_next_word(prev_words,model): res = [] for (next_word,prob) in model[prev_words].items(): res.append((next_word,prob)) return res

Say we want to list probabilities for words following the sequence ("beginning","to"). We do this as follows

# load corpus, create tokens and model txt = load_corpus("alice.txt") tokens = create_tokens(txt) m = create_model3(tokens) # test sequences wseq1 = ("beginning","to") wseq2 = ("nor","did") wseq3 = ("making","a") wseq4 = ("must","be") wseq = wseq4 # compute w3 with probabilities next_word_probs = compute_next_word(wseq,m) # print computed probs for (next_word,prob) in next_word_probs: print(next_word + " : " + str(prob))

For the selected sequence we get the following result

getting : 0.10526315789473684 shutting : 0.05263157894736842 kind : 0.05263157894736842 mabel : 0.05263157894736842 growing : 0.05263157894736842 a : 0.10526315789473684 the : 0.10526315789473684 really : 0.05263157894736842 thought : 0.05263157894736842 on : 0.05263157894736842 said : 0.05263157894736842 removed : 0.05263157894736842 off : 0.05263157894736842 more : 0.05263157894736842 collected : 0.05263157894736842 what : 0.05263157894736842

We can extend by keep choosing one as the second word in the sequence. That is we can do another run with ("be","collected") and so on.