Thursday, March 22, 2018

machine learning pipelines and testing vs validation

What's a typical machine learning pipeline? Here's my guess:

  • Start with raw data. 
  • Some entires may be incomplete; either toss those or find a way to complete them. 
  • there could be other preprocessing, like tossing outliers or data that is obviously from sensor malfunction. I don't like the "tossing outliers" idea so much ... you have to be careful with "cleaning" your data to make your analysis works out ... 
  • now feature extraction. Some features may be redundant, the data set might be too big, some of the data might be identified as unnecessary, or maybe the data doesn't obviously describe features we want to study so we want to put it in a human-understandable form. 
  • if the machine learning algorithm is supervised, label the data with "ground truth."
  • split up the data as needed into training and validation sets. 
  • choose appropriate machine learning algorithm and appropriate initial conditions/parameters. 
  • throw training data into machine learning algorithm and validate with validation data according to whatever process you've chosen. May also validate via other methods (like other machine learning algorithms?).
  • Hopefully it works! If not, revise process: cleaning; feature extraction; machine learning technique; or, worst of all, hyperparameter tuning (the learning algorithm was supposed to do this all on its own, wasn't it?). Or, maybe you learn that you don't have enough data for your learned model to stabilize, so you need more data. Iterate on process as needed until you have a validated model. 
  • Now the pipeline is set up. To make predictions about new data, clean and preprocess, feature extract, and put into machine learning algorithm to find the predictions. Interpret as needed. 

So I wrote this all at once, following my intuition from a data science perspective, and now I'm looking back. What's missing? 
  • There's no mention of testing the code to see if it's working properly, only validating the results. Some unit/integration tests would be nice. I supposed validation might be more important than making sure the algorithm is implemented correctly, but still... you don't want to have unreproducible results. 
  • There's no discussion on how to find an "appropriate" machine learning method or initial conditions/parameters. 
  • Do we need to weight certain data differently than other data? Do we need regularization? Are these just parameters that we assume are part of the algorithm? 
  • There's no planning stage. You probably need to plan what algorithms you're going to use before you do anything, including cleaning (maybe your algorithm is robust to outliers so you don't have to "clean" those out! "Data cleaning" gives me the heebie jeebies.). 
  • Furthermore, there's no mention of the application and the desired outcomes: the specific application will have an effect on cleaning, feature selection/extraction, and algorithm choice. 
  • There's no step about feature selection. You have to watch out for collinearity and whether your features are actually predicting anything useful. 
A lot of the above is probably wrapped up in the "iterating" step or can be summed up by "need to better explain planning, testing, and choosing parameters (including algorithms and input features)." Those seem slightly harder than the rest of the pipeline to me. The planning and choosing parameters because you don't necessarily know beforehand what setup is going to work best, and the testing because, well, testing is sometimes hard. I've literally never come across anyone talking about how they test their machine learning code (only how they validate). There are probably data sets out there with known results for certain (basic) algorithms because people make tutorials about learning machine learning. 

Maybe, for say, a neural network, there are some simple inputs (step functions on one input at a time?) where you can trace the effect they have on the whole network to make sure the network edges are set up properly. You might set all weights to 1 to make verification easier. This is my default for testing linear systems so it isn't necessarily all we need to test a nonlinear system, but it's a start. Could you use just step function inputs with all weights set to 1 (and biases set to... 0? something else) to verify that your network topology is as you expect? Hmmm, this actually feels like a kind of interesting side of machine learning algorithms: not just how you write them, but how you test them. Are the algorithms easy to test? What makes an implementation easier to test? Does anyone need such a test for network topology, since there are libraries to set up neural nets? Perhaps if someone got into pruning neural nets (using L1 regularization, maybe?) and they wanted to see what the output of their algorithm was... 

NB: Why do people call it "multicollinearity?" Why does "collinearity" have  two "L"s? Shouldn't it just be "co" and "linearity" put together into "colinearity," where the "multi" and the second "L" are redundant? What am I missing?

No comments:

Post a Comment