One-Hot Vector Approach

21.12.2020 | Processing/POS Tagging/PyTorch

Contents/Index

@1. One-Hot Vector Approach

We want to build a model for pos-tagging. We use The Penn Treebank Tagset as target. We build the model using PyTorch. This approach is an extension of the code found in Deep Learning With PyTorch. We train and test the pos-tagger on The Tell-Tale Heart corpus.

We use $log\ Sofmax$ on an affine hidden layer in order to do logistic regression. First we tokenize the input data and transform each word into a one-hot vector. The output of the network now becomes $$ log\ Softmax(A x + b) $$ where $x$ is the one hot encoded word. So let's go.

First we import needed libraries

import torch as ts import torch.autograd as autograd import torch.nn as nn import torch.nn.functional as F import torch.optim as optim ts.manual_seed(1) from data_loader import tell_tale import math import numpy as np

We load the data and split it into train and test by a 80%,20% ratio. Furthermore we create the vocabulary and the labels. And obtain the sizes of these. set is often used for this. The function removes duplicates from a list. Or equivalently transforms the list into a set.

data = np.array(tell_tale(True)) split_ix = math.floor(len(data) * 0.8) train_data = data[:split_ix] test_data = data[split_ix:] vocab = set(data[:,0]) labels = set(data[:,1]) vocab_size = len(vocab) labels_size = len(labels)

Next we need some encoding functions. That is we want to encode each word as an unique id. We want to do the same with each label. And we need to be able to obtain a label given some id.

# the id's for a word and a label is an int # corresponding to its index word_to_ix = {} for word in vocab: word_to_ix[word] = len(word_to_ix) # here we just create a list ix_to_label = [x for x in labels] label_to_ix = {} for label in ix_to_label: label_to_ix[label] = len(label_to_ix)

Next we define the class used for PyTorch training. Since we define the affine map within the class, PyTorch will take care of gradient calculations. nn.Linear in this case is an affine map that transforms the input from the vocabulary size into an output of the size of number of unique labels. Remember the affine map is given as $$ y = A x + b $$ nn.Linear just initializes $A$ and $b$ randomly.

class PosClassifier(nn.Module): def __init__(self,labels_size,vocab_size): # this is just std boiler plate # do it for all nn.Modules super(PosClassifier,self).__init__() # when we define the linear map here # torch will take care of grad calcs self.linear = nn.Linear(vocab_size,labels_size) def forward(self,one_vec): return F.log_softmax(self.linear(one_vec),dim=1)

The forward method defines the behavior of each gradient pass. Here we just use the build in log_softmax in order to obtain the logistic output. Note that softmax creates a probability distribution of the input vector. We do log on this distribution.

We define the input one-hot vectors along the targets. The target is an int corresponding to the position of the target label.

def make_1hot(word,word_to_ix): vec = ts.zeros(len(word_to_ix)) vec[word_to_ix[word]] = 1 # we need to enclose the one hot vector in another vector return vec.view(1,-1) def make_target(label,label_to_ix): return ts.LongTensor([label_to_ix[label]])

We define a function that can run the model. That is given some sample word we obtain a one-hot vector of this word. We run the vector through the model and obtain the log of a probability distribution over possible labels. The one element of this log-distribution with highest score is the one predicted by our model to be the right label for the input.

def run_model(model,sample): log_probs = None # use np.grad since we do not need grads for this with ts.no_grad(): one_vec = make_1hot(sample[0],word_to_ix) log_probs = model(one_vec) return log_probs

And we define the model.

#define the model model = PosClassifier(labels_size,vocab_size)

We can by now take some word and produce an output from the model. Since the model is at its initial state, the output is not very trustworthy. Though it can be illustrative.

test_run = run_model(model,test_data[0]) print(test_data[0]) print(test_run) print(ix_to_label[test_run.argmax()])

This results in the printout of

['my' 'PRP$'] tensor([[-3.2335, -3.2217, -3.2651, -3.2513, -3.2672, -3.2094, -3.2222, -3.2647, -3.2076, -3.1755, -3.2294, -3.1778, -3.2845, -3.1849, -3.1992, -3.2128, -3.1833, -3.2517, -3.2057, -3.2179, -3.2059, -3.1813, -3.2523, -3.1847, -3.1948]]) VB

So we have the word "my". It has the tag "PRP$". The model wants the word to have the tag "VB".

For training we need a loss function along an optimizer. The loss function is found in the nn package. It is key to building a model. We build by minimizing the loss for each train sample. Given the parameters $\theta$, the loss function $L$ and some point in time, $t$, we calculate parameters for the next time stamp as $$ \theta^{t + 1} = \theta^{t} - \eta \nabla_{\theta}(L(\theta)) $$ where $\eta$ is some appropriate set learning rate. For this task, and in general for classification, we use the cross entropy loss. We use the build in function nn.NLLLoss that does both cross entropy along adding the log softmax for us. The arguments of the function is the output vector of the model along the true label. optim.SGD is an optimizer that sets the learning rate.

loss_function = nn.NLLLoss() optimizer = optim.SGD(model.parameters(), lr=0.1)

The training is done with a loop. We define a number of epochs, that is the number of times training runs through the whole training set. For each epoch we loop through the training set. We reset the gradient since it accumulates. That is we do not want the gradient history from past iterations. We maintain an average loss, averaging over the whole epoch. This is only used for printing. It does not effect the model.

nr_epochs = 80 for epoch in range(nr_epochs): avg_loss = 0 for (word,label) in train_data: # reset grads model.zero_grad() # create one hot rep. of word one_vec = make_1hot(word,word_to_ix) # create target rep. of target label target = make_target(label,label_to_ix) # compute log softmax with current model state logits = model(one_vec) # compute loss of output against true label loss = loss_function(logits,target) # do gradient calcs and model update loss.backward() optimizer.step() # maintain avg loss avg_loss += loss.detach().numpy() print("avg-loss for epoch=" + str(epoch) + " : " + str(avg_loss / len(train_data)))

Lastly we define a tester. Again we use .argmax() to obtain the index of the element with the highest value. This is the predicted label

def test_the_thing(test_data): corrects = 0 mis_preds = [] for test_instance in test_data: res = run_model(model,test_instance) target0 = test_instance[1] sample0 = ix_to_label[res.argmax()] if target0 == sample0: corrects += 1 else: mis_preds.append((test_instance[0],target0,sample0)) print("----correct: " + str(corrects)) print("----out of:" + str(len(test_data))) return mis_preds

We can run the test.

print("##############################") mis_preds = test_the_thing(test_data) print(mis_preds)

A run on my system yields with the last 10 losses included:

avg-loss for epoch=70 : 0.05913580874555902 avg-loss for epoch=71 : 0.058163839813728066 avg-loss for epoch=72 : 0.05722546411501006 avg-loss for epoch=73 : 0.05631903097413809 avg-loss for epoch=74 : 0.05544320834485118 avg-loss for epoch=75 : 0.05459638538050818 avg-loss for epoch=76 : 0.053777287565085284 avg-loss for epoch=77 : 0.05298459170312727 avg-loss for epoch=78 : 0.05221713833908471 avg-loss for epoch=79 : 0.05147368836015671 ############################## ----correct: 359 ----out of:432 [('visitors', 'NNS', 'NN'), ('led', 'VBN', 'NN'), ('showed', 'VBD', 'NN'), ('treasures', 'NNS', 'NN'), ('undisturbed', 'JJ', 'NN'), ('chairs', 'NNS', 'NN'), ('desired', 'VBN', 'NN'), ('here', 'RB', 'NN'), ('their', 'PRP$', 'NN'), ('fatigues', 'NNS', 'NN'), ('reposed', 'VBN', 'NN'), ('satisfied', 'JJ', 'NN'), ('convinced', 'VBN', 'NN'), ('singularly', 'RB', 'NN'), ('answered', 'VBN', 'NN'), ('cheerily', 'RB', 'NN'), ('chatted', 'VBN', 'NN'), ('familiar', 'JJ', 'NN'), ('ere', 'RB', 'NN'), ('getting', 'VBG', 'NN'), ('wished', 'VBN', 'NN'), ('gone', 'VBN', 'NN'), ('ached', 'VBN', 'NN'), ('fancied', 'VBN', 'NN'), ('ringing', 'VBG', 'NN'), ('chatted', 'VBN', 'NN'), ('ringing', 'VBG', 'NN'), ('became', 'VBD', 'NN'), ('continued', 'JJ', 'NN'), ('became', 'VBD', 'NN'), ('talked', 'VBD', 'NN'), ('freely', 'RB', 'NN'), ('get', 'VB', 'NN'), ('feeling', 'VBG', 'NN'), ('continued', 'JJ', 'NN'), ('gained', 'VBN', 'NN'), ('talked', 'VBD', 'NN'), ('fluently', 'RB', 'NN'), ('heightened', 'VBN', 'NN'), ('gasped', 'NNS', 'NN'), ('talked', 'VBD', 'NN'), ('vehemently', 'RB', 'NN'), ('argued', 'VBD', 'NN'), ('trifles', 'NNS', 'NN'), ('high', 'JJ', 'NN'), ('gesticulations', 'NNS', 'NN'), ('gone', 'VBN', 'NN'), ('paced', 'VBN', 'NN'), ('strides', 'NNS', 'NN'), ('observations', 'NNS', 'NN'), ('oh', 'UH', 'NN'), ('god', 'NNP', 'NN'), ('foamed', 'VBN', 'NN'), ('raved', 'VBN', 'NN'), ('grated', 'VBN', 'NN'), ('continually', 'RB', 'NN'), ('chatted', 'VBN', 'NN'), ('pleasantly', 'RB', 'NN'), ('was', 'NN', 'VBD'), ('possible', 'JJ', 'NN'), ('god', 'NNP', 'NN'), ('suspected', 'VBN', 'NN'), ('making', 'VBG', 'NN'), ('better', 'RBR', 'NN'), ('tolerable', 'JJ', 'NN'), ('those', 'DT', 'NN'), ('hypocritical', 'JJ', 'NN'), ('smiles', 'NNS', 'NN'), ('again', 'RB', 'NN'), ('villains', 'NNS', 'NN'), ('dissemble', 'JJ', 'NN'), ('here', 'RB', 'NN'), ('here', 'RB', 'NN')]

I think I get the same ratio of 359/432 for as low as 60 epochs. Maybe even for less. The main thing here is that words like "visitors" are not present in the train data. So no matter the number of epochs the model can only randomly predict the right label for this word. Randomly since it might initialize right.