Pretrained Vector Approach

07.01.2021 | Processing/Language Models/PyTorch

Contents/Index

@1. Pretrained Vector Approach
2. Learning Vector Approach

Let's build a language model with PyTorch. This model is a quite simple feed forward network. The idea is the same as for POS-tagging. That is we train a neural network where the weights are initialized as a linear transformation, and the final layer is activated by a log softmax in order to obtain a distribution. However language models are based on a window of $n$ words (what is called an n-gram). So instead of labeling each word, we instead label $n$ words. Given the text:

I have three white mice in my left shoe

and given $n = 3$, we obtain the n-grams:

I have three have three white three white mice white mice in mice in my ...

As with the POS-tag problem we need to encode each word. In order for the model to generalize well, we have to model similarities in these encodings. We have two choices, we can use pre-trained vectors. Or we can add a layer for training embeddings. In this article we use GloVe pretrained vectors. In the next article we add the encoding layer.

For some $n$ we see a window of $n$ words as the input, and we see the next word in the text as target/label. In the above example we have

I have three -> white

So the idea is to encode the $n$ words on the left of the arrow (here three), concatenate the encoding, and then run the concatenation through two hidden layers/linear transformations, obtain loss and use this through training.

First we import data and create the n-grams:

import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim torch.manual_seed(1) from data_loader import tell_tale from load_vecs import glove import math import numpy as np data = np.array(tell_tale(False)) split_ix = math.floor(len(data) * 0.9) train_data = data[:split_ix] test_data = data[split_ix:] vocab = list(set(data)) vocab_size = len(vocab) window_size = 3 ngrams = [([train_data[j] for j in range(i,i + window_size)],train_data[i+window_size]) for i in range(len(train_data) - window_size)] ngrams_test = [([test_data[j] for j in range(i,i + window_size)],test_data[i+window_size]) for i in range(len(test_data) - window_size)] # The loss takes a distribution of vocab size as values # and an index into this distribution as target # Predicting words needs be based on index # Hence we need these functions word_to_ix = {vocab[i] : i for i in range(vocab_size)} ix_to_word = vocab # the emb_dim is the size of the emb vectors emb_dim = 50 w2vec = glove(emb_dim) synonyms = { "causeless":"excessive", "unperceived":"anonymous", "scantlings":"girders" }

Note a couple of things: The emb_dim is the size of the embedded words. We have that the linear transformation in the upcoming model takes input vectors of dimension emb_dim * window_size = 3 * 50. These vectors are transformed twice, the last time into vectors the size of the vocabulary denoted $|V|$. We do log softmax in order for the resulting vector to be the log of a probability distribution. Note also synonyms - the keys of this dictionary are words that were not present in the GloVe data set. So instead synonyms have been found online.

We define the model along a helper function that concatenates and encodes words, as follows

class NGramLangModel(nn.Module): def __init__(self,vocab_size,emb_dim,context_size): super(NGramLangModel,self).__init__() self.lin1 = nn.Linear(context_size * emb_dim,128) self.lin2 = nn.Linear(128,vocab_size) def forward(self,embs): h = F.relu(self.lin1(embs)) u = self.lin2(h) logits = F.log_softmax(u,dim=1) return logits def concat_words(w_window): retval = [] for w in w_window: if w in synonyms: w = synonyms[w] retval += w2vec[w] return torch.tensor(retval)

We have two linear transformations. The first takes $n$ embeddings and outputs a vectors of size $128$, this is just some number that should match the input of the second transformation. We do not go into tuning this dimension right now. We activate this first transformation with ReLU. The result is stored as $h$. This result is tranformed into $u$ which has log softmax applied to it. That is it.

We define needed functions along hyper-parameters as follows

loss_fun = nn.NLLLoss() model = NGramLangModel(vocab_size,emb_dim,window_size) optimizer = optim.SGD(model.parameters(),lr=0.001) # hyperparams n_epochs = 20

For the loss we use the nn.NLLLoss(). This loss function fits the log softmax in the model. They can be combined using nn.CrossEntropyLoss. We have a learning rate of $\eta = 0.001$. And we are set to go. The training is coded as:

for epoch in range(n_epochs): total_loss = 0 for ctx,target in ngrams: ctx_embs = concat_words(ctx).view((1,-1)) model.zero_grad() log_probs = model(ctx_embs) loss = loss_fun(log_probs,torch.tensor([word_to_ix[target]],dtype=torch.long)) loss.backward() optimizer.step() total_loss += loss.item() n0 = len(ngrams) print(total_loss / n0)

Note that we print the average loss, averaged over number of n-grams. We can print performance with the following code

correct = 0 fails = "" for ctx,target in ngrams_test: ctx_embs = concat_words(ctx).view((1,-1)) pred = model(ctx_embs) pred = ix_to_word[pred.argmax().item()] if pred == target: correct += 1 else: fails += pred + " != " + target + "\n" print(str(correct) + "/" + str(len(ngrams_test)))

For me the last three losses along the score is

5.085335830067875 5.051957758111811 5.018320765671314 17/213

There are room for optimizing performance. We miss some n-grams, that is the last word is only included once. In the above example we miss

left shoe [pad] shoe [pad] [pad]

We of course can play with the encoding, as we will see in the next article. And lastly the training set is quite sparse.