Thursday, March 22, 2018

machine learning pipelines and testing vs validation

What's a typical machine learning pipeline? Here's my guess:

• Some entires may be incomplete; either toss those or find a way to complete them.
• there could be other preprocessing, like tossing outliers or data that is obviously from sensor malfunction. I don't like the "tossing outliers" idea so much ... you have to be careful with "cleaning" your data to make your analysis works out ...
• now feature extraction. Some features may be redundant, the data set might be too big, some of the data might be identified as unnecessary, or maybe the data doesn't obviously describe features we want to study so we want to put it in a human-understandable form.
• if the machine learning algorithm is supervised, label the data with "ground truth."
• split up the data as needed into training and validation sets.
• choose appropriate machine learning algorithm and appropriate initial conditions/parameters.
• throw training data into machine learning algorithm and validate with validation data according to whatever process you've chosen. May also validate via other methods (like other machine learning algorithms?).
• Hopefully it works! If not, revise process: cleaning; feature extraction; machine learning technique; or, worst of all, hyperparameter tuning (the learning algorithm was supposed to do this all on its own, wasn't it?). Or, maybe you learn that you don't have enough data for your learned model to stabilize, so you need more data. Iterate on process as needed until you have a validated model.
• Now the pipeline is set up. To make predictions about new data, clean and preprocess, feature extract, and put into machine learning algorithm to find the predictions. Interpret as needed.

So I wrote this all at once, following my intuition from a data science perspective, and now I'm looking back. What's missing?
• There's no mention of testing the code to see if it's working properly, only validating the results. Some unit/integration tests would be nice. I supposed validation might be more important than making sure the algorithm is implemented correctly, but still... you don't want to have unreproducible results.
• There's no discussion on how to find an "appropriate" machine learning method or initial conditions/parameters.
• Do we need to weight certain data differently than other data? Do we need regularization? Are these just parameters that we assume are part of the algorithm?
• There's no planning stage. You probably need to plan what algorithms you're going to use before you do anything, including cleaning (maybe your algorithm is robust to outliers so you don't have to "clean" those out! "Data cleaning" gives me the heebie jeebies.).
• Furthermore, there's no mention of the application and the desired outcomes: the specific application will have an effect on cleaning, feature selection/extraction, and algorithm choice.
• There's no step about feature selection. You have to watch out for collinearity and whether your features are actually predicting anything useful.
A lot of the above is probably wrapped up in the "iterating" step or can be summed up by "need to better explain planning, testing, and choosing parameters (including algorithms and input features)." Those seem slightly harder than the rest of the pipeline to me. The planning and choosing parameters because you don't necessarily know beforehand what setup is going to work best, and the testing because, well, testing is sometimes hard. I've literally never come across anyone talking about how they test their machine learning code (only how they validate). There are probably data sets out there with known results for certain (basic) algorithms because people make tutorials about learning machine learning.

Maybe, for say, a neural network, there are some simple inputs (step functions on one input at a time?) where you can trace the effect they have on the whole network to make sure the network edges are set up properly. You might set all weights to 1 to make verification easier. This is my default for testing linear systems so it isn't necessarily all we need to test a nonlinear system, but it's a start. Could you use just step function inputs with all weights set to 1 (and biases set to... 0? something else) to verify that your network topology is as you expect? Hmmm, this actually feels like a kind of interesting side of machine learning algorithms: not just how you write them, but how you test them. Are the algorithms easy to test? What makes an implementation easier to test? Does anyone need such a test for network topology, since there are libraries to set up neural nets? Perhaps if someone got into pruning neural nets (using L1 regularization, maybe?) and they wanted to see what the output of their algorithm was...

NB: Why do people call it "multicollinearity?" Why does "collinearity" have  two "L"s? Shouldn't it just be "co" and "linearity" put together into "colinearity," where the "multi" and the second "L" are redundant? What am I missing?

Wednesday, March 21, 2018

neural nets: brute force?

I don't think neural nets are brute force nor black boxes, but that's what I've heard some people say. So that's what I thought for a little while...

Neural nets seem like brute force because you generally need huge training sets to get good results out of them (no comment on what "huge" and "good results" mean).

Now, to make a comparison between brute force and neural nets, I do want to first "define" what brute force is. To me, brute force for a neural network would be: Take all possible inputs and all possible configurations of the neural net parameters. Find the error associated to each neural net parameter set by computing the error for each input and then combining those errors into a total error for that parameter set (how do we measure and combine errors? I don't know, but I hope it's continuously). Do this for all parameter sets and choose a parameter set with a minimal error.

Caveat: does a minimum exist? An infimum does, but whether that's attainable is up to analysis. Neural nets are continuous maps from input to output, and for fixed input the map from parameters to error should be continuous, and we should probably define the parameters on a closed domain (but that depends on the applications I guess). Is that domain compact? That's really up to the application... So "brute force" doesn't always even give us a parameter set unless we have a compact domain for the parameters. Ew... well unimportant. The point is that neural nets are not brute force, even if they can have issues with stabilizing at a set of parameters. That's something to look into: When do optimal parameters for a neural net exist?

One argument I've heard against the "brute force" accusation (strong word choice much?) is using the example of image processing: How big is the configuration space for an image, and do we think a neural net is using all that information?

configuration space of a 720p rgb image

If you look at just one pixel, you have to specify the red, green, and blue components, each of which can take integer values between 0 and 255 inclusive. That give 256^3=16,777,216 different colors that each pixel can represent.

If you have 1280 by 720 pixel images then you get 921,600 pixels total. So, for a 720p rgb image, we're looking at 16,777,216^921,600 configurations. Which is a lot. For context, apparently there are 10^82 atoms in the known, observable universe [1]. So, do we really think neural nets are working through every conceivable image and updating accordingly?

Well, no one is training neural nets on 16,777,216^921,600-image training sets. But usually the images of interest do not span that whole configuration space; they probably cluster in a couple of areas. Perhaps if you restrict to just those areas, neural nets are basically brute-forcing it.

So maybe this argument is a bust? I can't really finish it satisfactorily at this point in time.

looking at the algorithm of neural nets

An argument that makes better sense to me for why neural nets are not brute force: neural nets are actually tracing a path through their configuration space and looking for a local minimum of a cost function. That in no way implies they are trying every conceivable combination of parameter values. They're only looking at (a discrete sampling of) a 1-dimensional piece of a definitely more than 1-dimensional parameter space! Neural nets don't always even reach the optimal configuration because they get stuck on local minima of the cost function. Brute force would have at least gotten the global minimum (again, if it exists...)!

in conclusion...

Neural nets are not brute force. They may seem like it because you start a neural net with "no" (ie, comparatively little) information about the system it's predicting and then throw a whole lot of training data at it to make it work. And some neural nets do work really, really well. But even though they're "slow" to train, they're not as slow as brute force. And even though neural nets can work really, really well, there are rarely (never? only sometimes?) guarantees that they produce (or even approach) optimal parameters, whereas brute force should be perfect in that regard.

And "brute force" isn't even a good analogy for neural nets! It's more like the neural nets make fewer assumptions, and so they need more training than models that make more assumptions (and thus start off kind of "pre-trained"). They're still a kind of heuristic, without all the glorious guarantees of optimal parameters and horrendously long computation times that come attached to brute force.

functional programming: from confusion to warm and fuzzy feelings

When I first heard of functional programming, it was explained at "stateless." From the object-oriented perspective I was raised with, this seemed crazy. How could you make something stateless? How do you organize everything? Well, you just pass information from function to function. But still, I didn't get it. What do you mean functions always do the same thing, no matter what you pass in? You can pass in all sorts of different information as input? What?

Then, Dmitri Vagner said that functional programming is just a categorical way of thinking about programming, and it all made sense. Of course functions don't update state! They're just functions, and they have an input and an output but they don't change the input; they just give a new object (object?) that is the output. A program just has a bunch of compositions of functions to get done what you need to get done, and you can say things about the compositions of functions without talking about specific inputs since you can say things about all the inputs at once (it's a function. It inherently knows about its domain and codomain. functions are lots of times the sole focus of study because functions hold all of the information, while objects can't even tell how they relate to one another!). Of course functional programing is (maybe? sometimes?) more efficient, since you can use your functor for anything in the category you're working in (and some functors work for many categories!). Of course you don't need state for that. You set up a useful system and to do a specific example, you just plug in the input and get some output.

And the immutable data of functional programming is not so scary; it's just different. Instead of updating an object (and possibly losing track of what you have and haven't done to it) you just make another one. And immutable data is just about the opposite of a global (mutable) variable, which is the bane of debugging and one of my biggest pet peeves. Macaulay2, I will never forgive you for having definition by equals sign give a global variable. It is sacrilege! But functional programming, you're newly added to my todo list. You're kind of near the end because you're not necessary at the moment, but you're on there.

Tuesday, March 20, 2018

linear regression by least squares

plot training data and the hyperplane defined by the linear regression output ($$Y=X\beta$$) (extend the $$X$$ vector by one dimension to include an extra $$1$$ to correspond to the constant term). Draw the errors, which connect points in the training data $$(x_i,y_i)$$ to the points that would be predicted by the model $$(x_i,x_i\beta)$$ (when X and Y are scalar, these are vertical lines on the graph). We want to minimize the sum of the squares of these errors, which you can do by taking the derivative of the sum of squares with respect to $$\beta$$, setting that to zero, and looking at the solutions (and endpoints) to find the smallest one (which does exist because all values are non-negative).

Note: $$x_i$$ input vectors are augmented with a 1 inserted into the first index of the vector. This is so the offset coefficient, $$\beta_0$$, can be included in the matrix notation. Then we can consider this as a linear problem instead of an affine one, and the only sacrifice seems to be just increasing the dimension by 1.

classification using a linear model

For using a linear model for classification, say of two categories, assign one class as 1 and the other as 0 (for n>2 classes, I assume you start using the standard basis vectors $$e_1,\ldots,e_n$$ to denote each class, but then a hyperplane only splits Euclidean space into two parts...). Use that as the Y for the input X in your training data.

What we want is a hyperplane in the same space as X that separates the Y=0 and Y=1 points (which is a hyperplane of dimension dim(X)-1). When you do linear regression on this training data, you get a hyperplane in ambient dimension dim(X)+dim(Y)=dim(X)+1 (thus this hyperplane is dimension dim(X), which is not what we want). To find the hyperplane in the ambient space of X that tries to separate the data, find the hyperplane $$\{X : X\beta=0.5\}$$. This takes our linear regression hyperplane down one dimension to dim(X)-1, and is still linear.

Why did we choose $$X\beta=0.5$$ in particular? What's special about 0.5? Well it's halfway between Y=0 and Y=1. And, it's the same cutoff we use for our predictions: if $$X\beta > 0.5$$, then that point is classified as Y=1; else, Y=0.

the issue: overfitting

A neural network is a model, which begs the question: when is the model susceptible to overfitting?

(Overfitting is when a model fits a training set extremely well, to the point that it learns the idiosyncrasies of the data chosen, including the specific errors. As a result the model fares poorly on data outside the training set; ie, it doesn't generalize well.)

"With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." - John von Neumann, exclaiming that using four or more parameters in a model almost always leads to overfitting.

So is four parameters too many parameters for a neural net? I hope not, because that would make a very constrictive upper bound on the size of useful neural nets.

the solutions

Obviously researchers have determined that neural nets with more than even five parameters can still be useful. So, why is that, or what tricks do people use to prevent overfitting?

choosing an appropriate network topology

If you make a neural network too big for the applications at hand your neural network might overfit your training set. Finding the appropriate size and topology of a neural net without trial and error sounds hard, though. But at least neural nets with more than five neurons have been useful, so von Neumann's bound for overfitting doesn't seem to apply.

regularization

This involves adding terms to the cost function. For Lp regularization (generally with p=1 or p=2) add terms $$\lambda||w_i||_p$$ to the cost function for each of the weight parameters $$w_i$$ with regularization strength $$\lambda$$. L2 regularization prevents any one weight from becoming too large, because apparently having "peaky" weights is bad. L1 regularization just keeps the total of the absolute values of the weights from becoming too big, which tends to drive some of the weights to almost zero. That's kind of cool, that L1 regularization essentially sparsifies the neural net. Apparently L2 regularization works better, though.

Notice that we didn't regularize all the parameters; we left out the bias parameters. Some justifications I've seen are that

• there are so few bias parameters compared to weight parameters that it doesn't matter
• bias parameters aren't multiplicative on the inputs so they don't matter as much?
• tests show that adding bias regularization doesn't hurt performance but it doesn't help either, so we should just leave bias regularization out of our neural nets.

early stopping

Implementing early stopping in your machine learning algorithm means you have some rule that you check at regular intervals while training. The rule tells you whether it's ok for the model to keep training or whether you need to stop training now, despite still having more training data to use.

An example of an early stopping rule is holdout-validation. Essentially, you break you training set up into smaller training and validation sets. At regular intervals, you test some of the validation sets and record how much "generalization error" (under what metric I have no idea) the model has at that point. At some point, the pattern of the recorded validation set errors will indicate that overtraining has begun (the simplest check is that once the error increases instead of decreasing, you've hit overtraining). Take the last version of the model before you hit overfitting as your final model.

I've heard that some people add noise to training data in order to prevent overfitting. I don't remember what this technique is called or exactly what it entails. Maybe I'll find it soon.

Saturday, March 17, 2018

resources

Some resources that have been recommended to me or I've found for machine learning:

Visualizers

A Neural Network Playground visualizes a neural net and lets you fiddle with parameters and topology. By Daniel Smilkov and Shan Carter.

Blogs

Data Science and Robots a blog with mostly video content about basic machine learning topics by Brandon Rohrer.

Textbooks

Neural Networks and Deep Learning a free online textbook by Michael Nielsen.

Elements of Statistical Learning 2nd ed (ESL) is a reference on machine learning topics and assumes statistical/mathematical background. By Hastie, Tibshirani and Friedman. Can be found free online.

Introduction to Statistical Learning: With Applications in R (ISL) is a reference on machine learning topics that is apparently simpler than ESL. By James, Witten, Hastie and Tibshirani. Can be found free online.

Handbook: Statistical foundations of machine learning by Gianluca Bontempi is supposed to have many fewer prerequisites than the ISL or ESL. Can be found free online.

neural nets: black boxes?

Almost 10 years ago I heard neural networks were essentially black boxes: once trained, their parameters to meant nothing to humans. They were weird algebraic (not even algebraic?) combinations of (possible human-meaningful) features, but there was no way to know which features were involved and what combinations of the features contributed to each parameter.

Definition: deep neural network. A neural network with more than one hidden layer.

Now with all the deep learning research, I see that there's a meaningful structure to neural nets. Each hidden layer is like its own (non-deep) neural net, and spits out a vector of features. Those features are the input to the next layer, and that second layer combines them to get more complex features. So you end up with this hierarchy, where the first layer makes the simplest features and they combine and build up over the layers to make more and more complex features, until the output of the neural net is a feature vector representing the most intricately detailed features of all.

I saw an example somewhere (can't find it now...) of a neural net with 3 hidden layers where the researchers somehow got the first layer to represent lines and curves of different shapes, the second hidden layer assembled these curves into the facial features we're all familiar with, and the final hidden layer produced full faces, each of which was significantly different from the next. That's a great example of how the layers of a neural net build upon one another.

Definition(?): deep learning. A machine learning paradigm where there are many layers or levels of complexity, where each level builds upon the one preceding it to add a level of abstraction. Deep neural networks are an example of deep learning.

stuff i still don't know

I don't know what kind of tuning these facial-recognition researchers did to get such human-meaninful features out of their neural net. I don't know if that's a common thing, or if this was a nice example that doesn't generalize. I would guess that it's the latter, and that a lot of the parameters that neural networks decide on are essentially meaningless to us outside of the context of the rest of the neural net.

I also don't know what added complexity a neural network can handle if it's "wide" (as opposed to deep).  For comparison, if adding more layers (and thus turning from a neural network to a deep neural network) adds a hierarchy to the feature extraction in neural nets, what does adding width do? That is, what happens when you add more nodes to a layer?

Thursday, March 15, 2018

my introduction to neural networks

The first encounter: My advisor said neural networks were slow and couldn't accomplish complicated behavior way back in 2010. No comment on whether that was true then or is true now.

From my second discussion about neural nets I "learned" that neural nets are essentially brute force.

Darryl Wade was my third neural nets discussion partner and the first person to describe neural nets to me as a rich topic, the first to explain them as more than a trendy black box. Thanks, Darryl.

what's a neural network?

Say you have some data and some features you want to extract from that data. If you put together a training set of data (input data + their corresponding features/ground truth) you can (try to) set up a neural network to do your feature extraction for any data you collect in the future.

The set up is this: the neural network is made of nodes organized into "layers." You have some input nodes, which map into the first layer of nodes, which map into the next layer of nodes ... which map into the output nodes. Nodes in the same layer do not map into each other.

A deep neural network just has more than one layer of internal/hidden/latent (not input or output) nodes.

Each node takes in input, does something non-linear to it, and passes the result on to the next node. The parameters of the neural net are what define the non-linear functions in each node. When you put training data through the neural net you compare the output (which represents the features) to the ground truth and use the difference of the two to tweak the parameters of the neural net.

Ok, but really, how do you tweak those parameters? As it trains, the neural network traces a path through the parameter space using hill climbing. The hill climbing is optimizing for a cost function (distance between output and ground truth). Apparently this has a fancy name ... backpropagation. Why do we need a fancy neural nets name for hill climbing? This idea will need more exploration.

questions that arise from this "definition"

For a system that I originally thought was grab and go (just set up your cost function and flip the switch!), there sure are a lot of choices to be make before you start training. How does choosing all of the following affect the predictive value of final parameters and the time needed to reach parameters that are "good enough?"
• initial conditions
• ordering of training data
• step size for hill climbing (once you've picked the direction you're going to move in the parameter space, how far do you go?)
• nonlinear functions in each node
• metric for comparing neural net output to training data ground truth
• And what might interest me the most (as of now): topology of the neural net
And there are probably more ...

ideas for future posts and more questions
• Neural nets are not brute force.
• How does backpropagation work and what problem is it fixing?
• What are other ways to use neural nets?
• What do the layers do? hierarchical feature extraction?
• What happens when the topology of the neural net has directed cycles?
• Can you start with an overkill neural net (lots of layers and lots of nodes) and then prune it?
• Are there ever substitutions/refactorings of the nodes that give you the same output with related response to training? Maybe you could condense a neural network or refactor it into one that's better understood.

first steps

This is a place to keep notes on interesting ideas I come across and questions those ideas elicit. The motivation for starting a blog now is the existential "crisis" I'm wandering into as I try to find a new place (field?) to work post-PhD. Now, at the start, it looks like the focus of the blog will be the interplay of geometry, topology, and machine learning, possibly with some specific examples in perception. But the topics are mostly dependent on where curiosity lures me.