## the issue: overfitting

A neural network is a model, which begs the question: when is the model susceptible to overfitting?

(Overfitting is when a model fits a training set extremely well, to the point that it learns the idiosyncrasies of the data chosen, including the specific errors. As a result the model fares poorly on data outside the training set; ie, it doesn't generalize well.)

"With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." - John von Neumann, exclaiming that using four or more parameters in a model almost always leads to overfitting.

So is four parameters too many parameters for a neural net? I hope not, because that would make a very constrictive upper bound on the size of useful neural nets.

## the solutions

Obviously researchers have determined that neural nets with more than even five parameters can still be useful. So, why is that, or what tricks do people use to prevent overfitting?

### choosing an appropriate network topology

If you make a neural network too big for the applications at hand your neural network might overfit your training set. Finding the appropriate size and topology of a neural net without trial and error sounds hard, though. But at least neural nets with more than five neurons have been useful, so von Neumann's bound for overfitting doesn't seem to apply.

### regularization

This involves adding terms to the cost function. For Lp regularization (generally with p=1 or p=2) add terms $$\lambda||w_i||_p$$ to the cost function for each of the weight parameters $$w_i$$ with regularization strength $$\lambda$$. L2 regularization prevents any one weight from becoming too large, because apparently having "peaky" weights is bad. L1 regularization just keeps the total of the absolute values of the weights from becoming too big, which tends to drive some of the weights to almost zero. That's kind of cool, that L1 regularization essentially sparsifies the neural net. Apparently L2 regularization works better, though.

Notice that we didn't regularize all the parameters; we left out the bias parameters. Some justifications I've seen are that

• there are so few bias parameters compared to weight parameters that it doesn't matter
• bias parameters aren't multiplicative on the inputs so they don't matter as much?
• tests show that adding bias regularization doesn't hurt performance but it doesn't help either, so we should just leave bias regularization out of our neural nets.

### early stopping

Implementing early stopping in your machine learning algorithm means you have some rule that you check at regular intervals while training. The rule tells you whether it's ok for the model to keep training or whether you need to stop training now, despite still having more training data to use.

An example of an early stopping rule is holdout-validation. Essentially, you break you training set up into smaller training and validation sets. At regular intervals, you test some of the validation sets and record how much "generalization error" (under what metric I have no idea) the model has at that point. At some point, the pattern of the recorded validation set errors will indicate that overtraining has begun (the simplest check is that once the error increases instead of decreasing, you've hit overtraining). Take the last version of the model before you hit overfitting as your final model.