@2. Deriving the model

3. Multinomial Logistic Regression

Given a sample space $\mathcal{X}$ and a label space $\mathcal{Y} = \{0,1\}$ we want to learn a model that best as possible predicts $\hat{y} = \sigma(w \cdot x + b)$ for each $x$.

Here we go through the mathematical reasoning given the above example. Later we can instead use the label space $\mathcal{Y} = \{1,-1\}$. This approach is outlined at the end of this page.

We need a metric for expressing how close the output $\hat{y}$ is to the true value $y$. Denote this $$ L(\hat{y},y) $$

We can derive this loss function given the example spaces above. We have the outcome/label space of $\{0,1\}$, this is the Bernoulli distribution. Let $p(y|x)$ be the probability of the correct label. Remember that we have $$ p(y = 1 | x) = \hat{y} $$ and that we have $$ p(y = 0 | x) = 1 - \hat{y} $$ We can combine these to get $$ p(y | x) = \hat{y}^{y}(1 - \hat{y})^{1 - y} $$ Note here that if $y = 1$, then the probability $p(y | x)$ simplifies to $\hat{y}$, that is $$ \hat{y}^1 (1 - \hat{y})^{1 - 1} = \hat{y} $$ If $y = 0$, then $p(y | x)$ simplifies to $1 - \hat{y}$. We do log on both sides, this will help us later when doing manipulation to these values. We get $$ log\ p(y | x) = log [\hat{y}^{y}(1 - \hat{y})^{1 - y}] = y\ log(\hat{y}) + (1 - y) log(1 - \hat{y}) $$ This result describes a log likelihood that should be maximized. We turn it into a loss, a quantity that should be minimized, by multiplying with $-1$ on both sides, hence we have $$ - log\ p(y | x) = -(y\ log(\hat{y}) + (1 - y) log(1 - \hat{y})) $$ We paste in the definition of $\hat{y} = \sigma(w \cdot x + b)$ in order to obtain the cross entropy loss: $$ L_{CE}(w,b) = -(y\ log(\sigma(w \cdot x + b)) + (1 - y) log(1 - \sigma(w \cdot x + b))) $$ Note how we now have a loss function that depends on the parameters $w$ and $b$. $x$ and $y$ are values obtained from the training set. These are known.

Our goal is to minimize the value $L_{CE}(w,b)$ with respect to the parameters $w$ and $b$. That is minimize the average loss over all the inputs and the labels: $$ \hat{\theta} = argmin_{\theta} \frac{1}{m} \sum_{i = 1}^{m} L_{CE}(y_i,x_i; \theta) $$ Here $\theta$ is a composite of the two parameters, that is $\theta = w,b$. The derivative of $$ L_{CE}(w,b) = -(y\ log(\sigma(w \cdot x + b)) + (1 - y) log(1 - \sigma(w \cdot x + b))) $$ is given as $$ \frac{\partial L_{CE}(w,b)}{\partial w^j} = [\sigma(w \cdot x + b) - y] x^j $$ The proof of this is found below.

We do gradient descent by the following steps for each pair $(x_i,y_i)$ in the training set:

- Compute the gradient of the loss for $x_i,y_i$ with respect to $\theta$, this is given as $$ [\sigma(w \cdot x + b) - y] x^j $$ for $w^j$. For $b$ we compute the gradient as $$ \sigma(w \cdot x + b) - y $$ We can see this gradient for $b$ as extending the target function as $$ f(x) = w^1 x^1 + \dots + b x^b $$ where $x^b = 1$ is a constant. These are stored as $$ g = [(\sigma(w \cdot x + b) - y) x^1 , \dots , \sigma(w \cdot x + b) - y]^{\top} $$ Equivalently we can obtain $g$ as $$ g = [x^1 , \dots , x^n, 1]^{\top} (\sigma(w \cdot x + b) - y) $$ since $(\sigma(w \cdot x + b) - y)$ is a scalar.
- $g$ is used to compute a step in the right direction using some learning rate $\eta$, that is for time stamp $t + 1$ we have $$ \theta_{t + 1} = \theta_{t} - \eta \cdot g $$

We repeat these two steps until we converge, until some threshold for $g$ is reached, or until some maximum number of iterations is reached. We can also stop when $g$ is getting larger again.

We can implement the above in Python with the following example. We have a set of points in $2d$ space, hence 2 features. Labels are defined as if some point is within or not some distance of the point $(0,0)$. We import numpy, define some labeling function, create test and training sets, and define $\sigma$ and the gradient of the loss as follows:

import numpy as np # label points based on dist from origo def pt2label(pt): dist_orig = np.sqrt(pt.dot(pt)) if dist_orig <= 0.5: return 0 else: return 1 #create points where coords are in [0,1] x_train = np.random.rand(1000,2) y_train = [pt2label(x) for x in x_train] x_test = np.random.rand(1000,2) y_test = [pt2label(x) for x in x_test] # define sigma def sigma(x): return 1 / (1 + np.exp(-x)) # define the function that returns the gradient to the loss def delta_loss(x,y,ws): # note that we append 1 to the input point x0 = np.append(x,1) b = ws[2] w = ws[:2] # return a vector of dimension 3 return (sigma(w.dot(x) + b) - y) * x0Next we define the gradient descent function as follows:

def do_grad(): l_rate = 0.1 # initialize weights # these include b ws = np.array([0.0 for _ in range(3)]) # stop if max_iter is reached max_iter = 10000 # if the average length of g is smaller than eps, then stop eps = 0.055 for i in range(max_iter): avg_grad = 0.0 for (x,y) in zip(x_train,y_train): grad = delta_loss(x,y,ws) ws -= l_rate * grad # compute grad length and add it to avg_grad avg_grad += np.sqrt(grad.dot(grad)) avg_grad /= x_train.shape[0] if i % 100 == 0: print(str(i) + ":" + str(avg_grad)) if avg_grad <= eps: print("has converged") break w = ws[:2] b = ws[2] # return a function that takes some point as input # and outputs the probability p(y = 1 | x) return lambda x: sigma(w.dot(x) + b)Lastly we create the model and test it on the test set:

model = do_grad() correct = 0 for (x,y) in zip(x_test,y_test): # use round to convert to label y_hat = round(model(x)) if y_hat == y: correct += 1 print(str(correct) + "/" + str(x_test.shape[0]))We can mess around with the l_rate parameter along the termination step in order to optimize. I have the following result of a run:

0:0.2750872037085482 100:0.05642907919044801 has converged 970/1000We need the two calculus results: $$ \frac{d}{dx} ln(x) = \frac{1}{x} $$ That is the derivative of the $ln$ function. Furthermore the sigmoid function has this very straight forward derivative: $$ \frac{d \sigma(z)}{dz} = \sigma(z)(1 - \sigma(z)) $$ We first find the derivative of the loss function with respect to a single weight, $w^j$, as: $$ - \left [ \frac{\partial}{\partial w^j} y\ log(\sigma(w \cdot x + b)) + \frac{\partial}{\partial w^j}(1 - y)log(1 - \sigma(w \cdot x + b)) \right ] $$ Here we have used that the derivative distributes. We use the chain rule in this expression to obtiain $$ - \frac{y}{\sigma(w \cdot x + b)} \sigma(w \cdot x + b)' - \frac{1 - y}{1 - \sigma(w \cdot x + b)} (1 - \sigma(w \cdot x + b))' $$ where ' denotes derivative. Rearranging terms we get $$ - \left [ \frac{y}{\sigma(w \cdot x + b)} - \frac{1 - y}{1 - \sigma(w \cdot x + b)} \right ] \sigma(w \cdot x + b)' $$ The expression within the [ ] brackets can be reduced to $$ \frac{y - \sigma(w \cdot x + b)}{\sigma (w \cdot x + b)(1 - \sigma(w \cdot c + b))} $$ By using the derivative of the sigmoid function along the chain rule we get $$ \sigma(w \cdot x + b)' = \sigma(w \cdot x + b)(1 - \sigma(w \cdot x + b)) x_j $$ Multiplying these two we have cancellation, and we get $$ - [y - \sigma(w \cdot x + b)] x_j $$ or equivalently $$ [\sigma(w \cdot x + b) - y] x_j $$

We can do yet a derivation of the loss function for labels in the set $\{1,-1\}$. Of course we could just remap the labels to fit the above derivations. But for the sake of whatever. The setup of everything else is the same as above. We have that $$ p(y = 1 | x) = \hat{y} $$ and that $$ p(y = -1 | x) = 1 - \hat{y} $$ Here we can exploit a quite cool fact about the sigmoid function, namely that $$ \sigma(-s) = 1 - \sigma(s) $$ So we can just multiply by $y$ to obtain $$ p(y | x) = \sigma(y (w \cdot x + b)) $$ We do the log loss transformation to obtain $$ L_{CE}(w,b) = - log\ \sigma(y (w \cdot x + b)) $$ We can find the gradient with respect to $w,b$ as thus. Denote the derivative by '. Denote $y(w \cdot x + b)$ by $\alpha$. By applying the chain rule and the derivative of $log$ we have $$ L_{CE}(w,b)' = - \frac{1}{\sigma(\alpha)} \cdot \sigma(\alpha)' $$ By the derivative of $\sigma$ and the chain rule we have that $$ \sigma(y (w \cdot x + b))' = \sigma(\alpha)(1 - \sigma(\alpha)) \cdot y x $$ We substitute in order to obtain $$ L_{CE}(w,b)' = - yx \cdot (1 - \sigma(\alpha)) $$ Or equivalently $$ L_{CE}(w,b)' = - yx \cdot \sigma(- \alpha ) $$ We can substitute one last time to obtain $$ L_{CE}(w,b)' = - yx \cdot \sigma(- y (w \cdot x + b) ) $$

CommentsGuest Name:Comment: