2. Deriving the model

3. Multinomial Logistic Regression

Logistic regression is the task of discriminatively classifying a set of inputs. That is given a label set of size two, let's say consisting of bikes and cars, the logistic model tries to learn what separates these two. Maybe wheel count can be a feature here. The model do only know about cars and bikes by these features that makes them different. Formally a discriminative model, given a class $c$ and a document $d$, attempts to compute $p(c|d)$ directly.

Logistic regression is the basis of neural networks. A single layer network with one neuron is exactly logistic regression where the output is the probability of assigning the "positive" label of two labels to the input. We can add more labels to the model obtaining what is called multinomial logistic regression.

Logistic regression is a supervised learning model for classification. The model works in general terms with the following setup

- Let $\mathcal{X}$ be the sample space of observations. We refer to a single observation by subscripts, that is $x_i \in \mathcal{X}$. Each $x_i$ is a representation of some value in the actual input domain. We refer to the $j$'th element of $x_i$ by superscript, that is $x_{i}^{j}$ for the $j$'th feature of $x_i$.
- Let $\hat{y}$ be the estimated value given as a probability. Denote the label space as $\{+,-\}$. Now $\hat{y}$ is the probability of $+$ given some input $x$. This probability is computed using the sigmoid function applied to the output of a linear regression model. The step where we apply this function we call the activation step. For a set of weights, $w$, a bias $b$, we have $$ \hat{y} = \sigma(w \cdot x_i + b) $$ Given that the label space/set of classes is $\{+,-\}$, we have that $$ p(y = + | x) = \hat{y} $$ and we have that $$ p(y = - | x) = 1 - \hat{y} $$
- Let $y$ be the true observed label.

During training we need a loss function for measurement of performance. For that we in the next chapter introduce the cross entropy loss.

Making use of vector notation logistic regression trains to optimize the weight matrix $w$ and the bias term $b$ in the equation $$ z = w \cdot x + b $$ This setup is the same as for linear regression. As an example let the weight matrix compose of the 4 values $$ w = [2,1,-4,-2] $$ We can observe that feature 0 has a positive impact, in fact the largest one. Features 2 and 3 have a negative impact, where feature 2 has the most negative impact.

Note that the weight matrix $w$ has length the number of features, and height equal to 1. This matrix is used as what is called a linear transformation, by matrix multiplication it transforms a vector of some dimension into a vector of another dimension. In the above example it transforms a vector of dimension 4 into a vector of dimension 1, that is a scalar. Hence for logistic regression how complex the model is, relies on the number of features. In the multinomial approach (described in the bottom of this page and in the last chapter) we have more than 2 labels, and hence the model complexity relies on two parameters.

In order to make the above setup into a classification we activate it with the sigmoid function: $$ \sigma(z) = \frac{1}{1 + e^{-z}} $$ We have that $\sigma(x) \in (0,1)$. This function has the following important behavior

- If we have $z \lt 0$, then $0.0 \lt \sigma(z) \lt 0.5$
- We have that $\sigma(0) = 0.5$
- If we have $0 \lt z$, then $0.5 \lt \sigma(z) \lt 1.0$

So given the example $w$ above, given $b = 1$, and given $x_1 = [1,1,1,1]$, we have $$ z_1 = w \cdot x = 2 + 1 + -4 + -2 + b = -3 + b = -2 $$ and hence that $0.0 \lt \sigma(z_1) \lt 0.5$. So $\sigma(z_1)$ will give highest probability to a negative label.

The above is an example of a binary logistic model. Binary here means that we have two possible outcomes, for example *car* or *bike*, 0 or 1, or so on. In the later chapters we will look at how to create models using PyTorch that are not binary, this approach is called multinomial logistic regression. The approach is the same as for the binary model, though we use softmax for activation instead of sigmoid. In this way we still have a distribution as output.

CommentsGuest Name:Comment: