Linear regression using python

statistics/python :: 09-12-2017

OK. So I have started an introductionary course in machine learning. Machine learning is, somewhat, making models using regression. And the first example involves linear regression. I have long pondered on effective ways to do this kind of things on a computer. Finally I have found one. So lets delve. Regression is fitting a functions values as close to a set of observed values as possible. Hence taking the mean of the distance of all function/model values to the appropriate observed value. Let say we have a bunch of observed values: $$obsXs = [1.2,2.0,3.3,4.0,4.9]$$ $$obsYs = [2.5,3.5,4.0,4.5,5.0]$$

Linear regression easily can be done using maple, simply by typing the commands:

with(Statistics)
obsXs := Vector([1.2, 2.0, 3.3, 4.0, 4.9])
obsYs := Vector([2.5, 3.5, 4.0, 4.5, 5.0]) Fit(a*x+b, obsXs, obsYs, x, summarize = embed)

And we get the result $$0.633138751683879x+1.94993264481365$$ with the rather good R-squarred $0.965109$

But what is interresting is of course what is going on under the hood. So lets have a look. First define the squarred error as $$sqrErr = (y_n - f(x_n))^2$$ That is the squarred distance from a value from our model, f, and to the observed value. Second lets define a prototype for our model, that is a basic linear function $$f(x) = w_1 x + w_0$$ where $w_1,w_0$ are the unknowns we are interested in finding optimally: we want the mean squarred error to be as small as possible: $$argmin_{w_0,w_1} 1/n \sum_{n = 1} sqrErr(w_0,w_1)$$

These kind of problems are solved by differentiating and solving the result equal 0. First rewrite: $$1/n \sum (y_n - f(x_n))^2 = 1/n \sum (y_n^2 - f(x_n))^2 = 1/n \sum y_n^2 + f(x_n)^2 - 2y_n f(x_n)$$

Substitute: $$1/n \sum y_n^2 + f(x_n)^2 - 2y_n f(x_n) =\\ 1/n \sum y_n^2 + (w_1 x_n)^2 + w_0^2 + 2 w_1 w_0 x_n - 2y_n (w_1 x_n + w_0)$$

Now differentiate: $$\frac{\partial sqrErr}{\partial w_1} = 1/n \sum 2 w_1 x_n^2 + 2w_0 x_n - 2y_n x_n$$ And $$\frac{\partial sqrErr}{\partial w_0} = 1/n \sum 2w_0 + 2w_1 x_n - 2y_n$$

Now solve first the last expression equal 0: $$0 = 1/n \sum 2w_0 + 2w_1 x_n - 2y_n \Rightarrow w_0 = - 1/n \sum w_1 x_n + y_n$$ This can be seen as: $$\hat{w_0} = \bar{y} - w_1 \bar{x}$$ Where the bar denotes the mean value, and the hat denotes argmin. The first derivative equal 0: $$0 = 1/n \sum 2w_1 x_n^2 + 2w_0 x_n - 2y_n x_n \Rightarrow \\ 0 = 1/n \sum w_1 x_n^2 + w_o x_n - y_n x_n \Rightarrow \\ 0 = 1/n \sum (w_1 x_n^2 + \bar{y} x_n - w_1 \bar{x} x_n) - \bar{yx} \Rightarrow \\ 0 = w_1 (\bar{x^2} - (\bar{x})^2) + \bar{y} \bar{x} - \bar{yx} \Rightarrow \\ \hat{w_1} = \frac{\bar{yx} - \bar{y} \bar{x}}{ \bar{x^2} - \bar{x}^2 }$$

What we need is to check is that the second derivative is bigger than 0. But it is. Now we have a formula for doing linear regression. Lets apply with pyhton to the two vectors from above. First we define some aux functions:

def initA(f,l): retval = [] for i in range(0,l): retval.append(f(i)) return retval def foldA(f,a,acc): l = len(a) for i in range(0,l): acc = f(a[i],acc) return acc def zipA(f,a,b,acc): l = len(a) for i in range(0,l): acc = f(a[i],b[i],acc) return acc

Then the two vectors

obsXs = [1.2,2.0,3.3,4.0,4.9] obsYs = [2.5,3.5,4.0,4.5,5.0]

So we have to find the mean of the following values $$x, x^2, y, yx$$

Lets go:

obsXs = [1.2,2.0,3.3,4.0,4.9] obsYs = [2.5,3.5,4.0,4.5,5.0] len_xs = len(obsXs) len_ys = len(obsYs) mean_x = foldA(lambda x,acc: x / len_xs + acc,obsXs,0) mean_y = foldA(lambda y,acc: y / len_ys + acc,obsYs,0) mean_x2 = foldA(lambda x,acc: x**2 / len_xs + acc,obsXs,0) mean_xy = zipA(lambda x,y,acc: (x * y) / len_xs + acc,obsXs,obsYs,0) w1 = (mean_xy - mean_x * mean_y) / (mean_x2 - mean_x ** 2) w0 = mean_y - w1 * mean_x print "f_model1(x) = " + str(w1) + "x + " + str(w0)

Although this alone is pretty cool, lets find that squarred r thing. According to this WikiPage we can see that we have to calculate the following numbers:

def f_model1(x): return w1 * x + w0 ss_tot = foldA(lambda y,acc: (y - mean_y) ** 2 + acc,obsYs,0) ss_res = zipA(lambda x,y,acc: (y - f_model1(x)) ** 2 + acc,obsXs,obsYs,0) print "r^2 = " + str(1 - ss_res / ss_tot)

Don't mind the ss-thing. This is still pretty cool!