Basic N-Gram Model in Python

18.05.2020

I have had several encounters with language models in the past few months. So why not get into the idea behind. Here's the basic with a python implementation.

The theory

We are looking into language models of the n-gram type. We consider word based models. Another type is character based. So given some sequence of $n$ words we consider what the next word can be. That is the probability given as $$ p(w_i | w_{i - (n - 1)} ... w_{i - 1}) $$ That is given a sequence of words $w_{i - (n - 1)} ... w_{i - 1}$ we want to know which words follows with what probability. Furthermore we simplify the problem: instead of looking at the text as a whole, for each word we are only interested in the preceding $n - 1$ words. This is called Markov Property. It can be seen as a state machine where we history wise are only interested in a limited amount of preceding states.

Corpus

We use Alice in Wonderlands. The text can be found here (alice.txt file). We need to manipulate the file into tokens and then into sequences of $n$ words.

import re # load file def load_corpus(fname): f = open("corpus/" + fname,"r") res = f.read() f.close() return res # create a list of tokens def create_tokens(txt): # replace special chars txt = txt.replace("&","and") # join words into one word if seperated by one of txt = txt.replace("-","") txt = txt.replace("'","") txt = txt.replace("’","") # allowed alphabet for words rx = re.compile("[a-zA-Z0-9]+") t = rx.findall(txt) return [word.lower() for word in t] # create a list with lists of sequences def create_seqs(n,tokens): res = [] for i in range(n - 1,len(tokens)): s0 = [] for j in range(i + 1 - n,i + 1): s0.append(tokens[j]) res.append(s0) return res

Try that on alice.txt. The sequences are for each word and then the preceding $n$ words. For the text "one two three four" and $n = 2$ we get

"one","two" "two","three" "three","four"

Creating the model

To make things more concrete consider $n = 3$. We will use this when building the model. So given a sequence of two words $s_1 = w_1 . w_2$ we want to know the probability that a word $w_3$ follows $s_1$. This is simply done as $$ \frac{count(w_1 . w_2 . w_3)}{count(w_1 . w_2)} $$ where $count$ counts the number of occurrences of the given sequence in the text. Now we build a model, a python dictionary that has a tuple of $(w_1,w_2)$ as key and a dictionary as value. This value has $w_3$ as key. And the frequency of the sequence $w_1 . w_2 . w_3$ as value. Lastly we turn this frequency into probabilities by dividing with the frequency of the sequence $w_1 . w_2$:

def create_model3(tokens): tris = create_seqs(3,tokens) res = {} # count the number of w3 given w1.w2 for w1,w2,w3 in tris: if not (w1,w2) in res: res[(w1,w2)] = {} if not w3 in res[(w1,w2)]: res[(w1,w2)][w3] = 0 res[(w1,w2)][w3] += 1 # turn into probabilities for (seq,v) in res.items(): seq_freq = float(sum(v.values())) for word in v.keys(): res[seq][word] /= seq_freq return res

Note that $count(w_1 . w_2) = count(w_1 . w_2 . w_i)$ for all possible $w_i$. And that is it. We can run the model with the following function

def compute_next_word(prev_words,model): res = [] for (next_word,prob) in model[prev_words].items(): res.append((next_word,prob)) return res

Say we want to list probabilities for words following the sequence ("beginning","to"). We do this as follows

# load corpus, create tokens and model txt = load_corpus("alice.txt") tokens = create_tokens(txt) m = create_model3(tokens) # test sequences wseq1 = ("beginning","to") wseq2 = ("nor","did") wseq3 = ("making","a") wseq4 = ("must","be") wseq = wseq4 # compute w3 with probabilities next_word_probs = compute_next_word(wseq,m) # print computed probs for (next_word,prob) in next_word_probs: print(next_word + " : " + str(prob))

For the selected sequence we get the following result

getting : 0.10526315789473684 shutting : 0.05263157894736842 kind : 0.05263157894736842 mabel : 0.05263157894736842 growing : 0.05263157894736842 a : 0.10526315789473684 the : 0.10526315789473684 really : 0.05263157894736842 thought : 0.05263157894736842 on : 0.05263157894736842 said : 0.05263157894736842 removed : 0.05263157894736842 off : 0.05263157894736842 more : 0.05263157894736842 collected : 0.05263157894736842 what : 0.05263157894736842

We can extend by keep choosing one as the second word in the sequence. That is we can do another run with ("be","collected") and so on.