## Tuesday, March 20, 2018

### linear regression by least squares

plot training data and the hyperplane defined by the linear regression output ($$Y=X\beta$$) (extend the $$X$$ vector by one dimension to include an extra $$1$$ to correspond to the constant term). Draw the errors, which connect points in the training data $$(x_i,y_i)$$ to the points that would be predicted by the model $$(x_i,x_i\beta)$$ (when X and Y are scalar, these are vertical lines on the graph). We want to minimize the sum of the squares of these errors, which you can do by taking the derivative of the sum of squares with respect to $$\beta$$, setting that to zero, and looking at the solutions (and endpoints) to find the smallest one (which does exist because all values are non-negative).

Note: $$x_i$$ input vectors are augmented with a 1 inserted into the first index of the vector. This is so the offset coefficient, $$\beta_0$$, can be included in the matrix notation. Then we can consider this as a linear problem instead of an affine one, and the only sacrifice seems to be just increasing the dimension by 1.

### classification using a linear model

For using a linear model for classification, say of two categories, assign one class as 1 and the other as 0 (for n>2 classes, I assume you start using the standard basis vectors $$e_1,\ldots,e_n$$ to denote each class, but then a hyperplane only splits Euclidean space into two parts...). Use that as the Y for the input X in your training data.

What we want is a hyperplane in the same space as X that separates the Y=0 and Y=1 points (which is a hyperplane of dimension dim(X)-1). When you do linear regression on this training data, you get a hyperplane in ambient dimension dim(X)+dim(Y)=dim(X)+1 (thus this hyperplane is dimension dim(X), which is not what we want). To find the hyperplane in the ambient space of X that tries to separate the data, find the hyperplane $$\{X : X\beta=0.5\}$$. This takes our linear regression hyperplane down one dimension to dim(X)-1, and is still linear.

Why did we choose $$X\beta=0.5$$ in particular? What's special about 0.5? Well it's halfway between Y=0 and Y=1. And, it's the same cutoff we use for our predictions: if $$X\beta > 0.5$$, then that point is classified as Y=1; else, Y=0.