PyTorch for learning the XOR Problem

16.11.2020 | Neural Networks/Simple Feedforward

Contents/Index

1. Introduction with the XOR problem
@3. PyTorch for learning the XOR Problem

We can build a neural network using the PyTorch library. This library supports several different types of neural networks. Let's start with the most simple one - a feedforward one. We want to build a neural network (from now on just nn) for the XOR-problem. In order to do this we need to reformulate the python numpy solution into one that fits PyTorch.

We start by including things, among them are torch:

import torch as ts import numpy as np

A PyTorch tensor is essentially a list with some header information like runtime environment. Or we can see it as a numpy vector/matrix. That is we treat tensors as vectors/matrices we can do operations on. Instead of the approach used in the XOR-problem article, we can simplify: the program input is, instead of one vector with boolean values, a matrix with all possible boolean combinations. The output now is a vector with a result for all these combinations (there are 4). Hence input is a matrix of dimension $2 \times 4$, and the output is a vector of dimension $4$.

Let's start by defining the input and the output. The output is used as true values, we name them labels.

source = [[0,0],[0,1],[1,0],[1,1]] source = [[float(x) for x in xs] for xs in source] source = ts.tensor(source) labels = [0,1,1,0] labels = [float(x) for x in labels] labels = ts.tensor(labels)

Tensors often needs to be float. Unless defined otherwise. We just convert the python list of ints to a python list of floats, and then we create a tensor of this list. Next we need the do_xor function.

relu = ts.nn.ReLU() def do_xor(xs, w, hbs, yws): layer_h = xs @ w + hbs out_h = relu(layer_h) layer_y = out_h @ yws out_y = relu(layer_y) return out_y

The @ operator is torch for matrix multiplication. We skip the bias for the output, that is yb. In the original problem this is set to 0 anyway, and we want to keep it simple for this. As can be seen $ReLU$ is part of torch. We can define the rest of the function using weights and biases from the previous approach:

def do_xor_ref(xs): # reference values hw0 = ts.tensor([[1.0,1.0],[1.0,1.0]]) hb0 = ts.tensor([0.0,-1.0]) yw0 = ts.tensor([1.0,-2.0]) return do_xor(xs,hw0,hb0,yw0) print(do_xor_ref(source).tolist())

And we have the list $[0,1,1,0]$ printed. The .tolist() is used to extract values from a tensor.

Now for training: Given some $hw$ and $hb$, weights respectively bias for the hidden layer, we want to use gradient descent on these to optimize the values of them. We do this by initializing them, use them to compute some prediction, $\hat{y}$, and then we compare this to the true output, here $labels$. We can extend training to also include the output weights $yw$. Hence we add an extra layer for training. But we skip this for now in order to keep it simple and understandable.

We need a helper function for initializing values. We use random floating point numbers taken as samples, that is in the set $[0,1]$. Then we multiply by something:

def init_val(): retval = 2 * np.random.random_sample() return retval

The last thing we need besides the actual training is a loss function. We need to keep track of the error size between $\hat{y}$ and the true output $labels$. For this we use the mean squared error function:

def error(pred, target): return ((pred - target) ** 2).mean()

Good. As stated we do training by gradient descent on the predicted values vs. the true values. The training function is as follows:

def train(n_epochs,b_size,source,labels): # this is used for iterations s1 = range(source.shape[1]) # factor with which we do grad_dec grad_dec_fact = 0.001 hw = None hb = None yw = ts.tensor([1.0 for _ in s1],requires_grad=False) loss = None best_loss = 3000.0 for i_epoch in range(n_epochs): hw0 = ts.tensor([[init_val() for _ in s1] for _ in s1],requires_grad=True) hb0 = ts.tensor([init_val() for _ in s1],requires_grad=True) yw0 = yw for _ in range(b_size): pred = do_xor(source, hw0, hb0, yw0) loss = error(pred, labels) loss.backward() hw0.data -= grad_dec_fact * hw0.grad.data hb0.data -= grad_dec_fact * hb0.grad.data hw0.grad.zero_() hb0.grad.zero_() if best_loss > loss: best_loss = loss hw = hw0 hb = hb0 print("loss[i_epch:" + str(i_epoch) + "] = " + str(loss.item())) return best_loss,hw,hb,yw

Both hw0 and hb0 are initialized randomly for each epoch. Since this very much have an impact on the final loss value, we have epochs. That is we do it all over with different initialized values. The yw's are all 1 here. One can experiment with different values, but I found 1 to work quite well. Both $h$-layer values are defined as tensors with the flag requires_grad set to true. With this the gradient is very easy to obtain using torch. After computing the loss, we run .backward() on the loss, and gradients for the values which have the flag set and are used during process of computing loss, have a .grad with gradient data. The only requirement I have found is that loss is a scalar value.

We obtain the gradient and do descent in line 25+26. We clean up the gradients in line 28+29. For each epoch we compare the loss with best loss, if lower, we update return values accordingly. b_size is batch size. I think this is the right name for it. It is the number of iterations we use for each gradient descent run.

Next we define some hyper parameters and do the training:

# hyper parameters b_size = 20000 n_epochs = 4 # do training loss,hw,hb,yw = train(n_epochs,b_size,source,labels)

The actual model is contained in hw,hb,yw. Lastly we print obtained values:

print("final loss = " + str(loss.item())) print("------------------------------------------------") print("test run:") print(do_xor(source,hw,hb,yw).tolist()) print("ref run:") print(do_xor_ref(source).tolist())

A run on my computer has the following result:

loss[i_epch:0] = 0.0005381641094572842 loss[i_epch:1] = 0.0003000512660946697 loss[i_epch:2] = 0.0004829175886698067 loss[i_epch:3] = 0.2500002086162567 final loss = 0.0003000512660946697 ------------------------------------------------ test run: [0.023878801614046097, 0.9773322939872742, 0.9916484355926514, 0.0067838989198207855] ref run: [0.0, 1.0, 1.0, 0.0]

As can be seen the result is quite good. This model is indeed not very general. It can only solve one specific problem. Besides showing how to train a nn, of course.

In this example PyTorch is only really used for computing gradients. Later I will show how the library can be used to further abstract away tedious details.