Friday, April 13, 2018

analyzing data from single-cell RNA sequencing

There's an interesting problem floating around biology circles: biologists sometimes want to study, say, the RNA is just one cell. As in, you don't have a bunch of cells and average the results, you just have one cell to take samples from.

Why would you want to use single-cell sequencing? Well, if you worry that you have different types of cells, or cells that have different purposes, then to distinguish between them and avoid averaging over all the different types of cells you'd want to look at one cell at a time.

What's difficult about analyzing single-cell sequencing data?

Averaging is a smoothing operation, so when you're looking at averages you can use some well-behaved models, like linear models. But without averaging, without smoothing, without the law of large numbers... you're left with sparse, noisy data. What model are you going to fit to that?

data cleaning

I saw a talk the other day by Anna Gilbert and it was pretty cool. It was about data cleaning and it was interesting!

I had never thought of data cleaning as interesting (or acceptable) because it always seemed like people were throwing out inconvenient data points willy-nilly. Like people would arbitrarily decide what an outlier was and just throw those data out. Horrifying.

Anna talked about the issue of getting noisy pairwise distance data. The thing about distance data is that it's supposed to come from a metric, but the noise can mess up the triangle inequality sometimes. Having a metric is usually useful downstream for other computations and can give better guarantees or results. So, in metric repair, the idea is to adjust some of the distances so that all the triangle inequalities are satisfied.

Anna's idea is to do sparse metric repair, which means changing as few distances as possible while making the distances satisfy the triangle inequality.

She didn't have an application. Someone pointed out that usually, all sensors have a little bit of noise so requiring sparsity isn't really necessary. My thoughts are that yes, sensors are noisy, but they usually give values close to ground truth. But sometimes sensors malfunction and give readings that are way off of ground truth. If you were trying to fix those extreme outliers, you would try to change as few distances as possible, but the ones you did change you might allow to be altered a significant amount. Possible applications are distributed sensing, robot swarms, and DNA sequencing for phylogenetic tree reconstruction.

Thursday, April 5, 2018

generative adversarial networks (GANs) and a weird ML code-phrase

what is a GAN supposed to do? 

You input some data and the point is to output data that's similar to the input, but synthetic. You try to infer the distribution underlying your sample, and then you spit out other points from that distribution. So if you have a sample of cute doggy faces, you'd expect to be able to produce new, synthetic pictures of lots of cute doggy faces.

to set up a GAN you need:

GANs use unsupervised learning, so you don't need a labeled data set!

You do need adversary neural nets, the actor and the critic. The actor tries to mimic the true data, and the critic tries to tell the difference between the real sample points and the actor's synthetic data. Each learns in turn: First, the actor learns how to outsmart the critic (so the critic cannot differentiate between real and synthetic data), and then the critic learns how to catch the actor (ie, it learns to tell the difference between real and synthetic data), and then the actor learns some more so that it can once again outsmart the critic, and on and on.

The one thing I don't quite understand is where the start of this all is. Once the actor and the critic are mostly on track it seems like it wont be hard for them to continue, but each neural net needs the other to measure their success. So which do you train first: the chicken or the egg? And how do you do that training? Or can you really just throw in some random parameter values and expect the system to converge to what you want?

thoughts on GANs 

This whole setup just begs for some convergence theorems, doesn't it? And apparently GANs are really finicky to train... which implies that people aren't using good* convergence theorems... which could imply that good convergence theorems don't exist, but could also just imply that good convergence theorems do exists but people aren't using them... Oh, or it could imply that convergence is fine but what you end up with is just not the result you wanted. For example, maybe you need to choose a better set of features as inputs to the neural nets.

*What's a "good" (set of) convergence theorem(s)? Well, a theorem should actually work out in practice (shouldn't stability-like theorems always work in practice? that's the whole point!). That means training should finish in the proscribed time, which should be finite and reasonable. That also means the theorem should apply to real applications. A GAN maybe doesn't have to converge for all possible distributions-from-which-input-is-sampled, but if it doesn't apply to a significant** chunk of distributions then we should be able to check whether any given distribution is in that chunk. And then, of course, to be a "good" theorem, we need to know which initial conditions for the parameters lead to convergence.

**either significant in size or significant in terms of applications.

questions about norms that aren't normal to me

 The topic of convergence theorems for GANs is pretty interesting to me, but at the same time I was learning about GANs I also learned another interesting tidbit:

I heard about all this at a talk about GANs from Larry Carin (who perhaps I should cite for all this information? Larry Carin, Triangle Machine Learning Day at Duke, academic keynote talk, April 3, 2018). An audience member from industry was immediately interested in Larry's new method of setting up GANs and wanted to know if papers or code was posted anywhere, and Larry just says "it's under review" and nothing else. Well, he said "it's under review" twice when the person from the audience pushed him on it.

So, does he not put preprints on arXiv? If he doesn't, why not? And furthermore, why didn't he explain why the paper isn't publicly available? Is he worried about being scooped? (it's already submitted!) Is he worried about copyright? (Ew. Journals and conferences who don't let you offer free preprints are the worst, but usually an author will still email a copy to a person if they ask.) Is he worried the reviews are going to come back indicating major errors? (then why is he talking about the project?) Doesn't machine learning research move really fast, so shouldn't he want it out? Oooooh, maybe he has a student who's working on a problem based off this work and he doesn't want his student to get scooped. So he's giving the student as much of a head-start as he can.

To this last guess, I have:
My 1st reaction: "That's sweet of him."
2nd: "Wait, no, *tries to think of a way this impacts disadvantaged students* ... hmmmm
3rd: "I guess it's bad for the field?"
4th: "It must be hard for students in this field. Getting scooped is not fun, especially not for a dissertation. Maybe this is an acceptable protection for someone entering the field."
5th: "What if he's protecting someone who isn't just entering the field? He could be doing it so some already-established academic can get a leg up. Is that acceptable?"

Instead of falling down this rabbit hole, I'll conclude: Why is he talking about this work now if he has written a manuscript but wont make it available? Well, I guess maybe it is available in the sense that if I email him he might send me a copy. But still, why not post it on arXiv? And whatever the reason is, why wont he explicitly say it's not publicly available yet? There are a lot of norms in the machine learning community that I don't know about. Apparently one of them is that "it's under review" is code for something -- something slightly uncomfortable and thus not to be talked about in mixed company -- and I do not know what that something is.

what i do right now

There are a lot of different ways to describe what most people do. It's about the context you choose. I'll try a few different contexts here as I try to describe what I do.

feature extraction

features are measurable properties of data. You might have a picture, and a feature might be how many humans are in the picture. You might have a song, and a feature might be the key, or the tempo, or a chord progression. A "good" set of feature gives important information (whatever "important" means); doesn't have overlap, or more than one feature representing the same information; and can differentiate between the data being studied.

The important thing to notice about features is that data is not always in a form that makes a given feature easily accessible; data doesn't necessarily directly describe features of interest, they might he indirectly hinted at in complicated ways. This is unfortunate because statistics generally deals with explicit data, not the hidden nuggets of information that we may really want to work with.

For example, a picture is a bunch of color data. It explicitly states how much red, blue, and green is at each pixel. What does that tell a human vs a computer about how many dogs are in a picture? Humans are good at extracting the number of dogs from pictures, and we can correctly identify the number for lots of pictures. If you run a basic statistics algorithm on image data, though, it's going to do all of its statistics on colors, not numbers of dogs. A computer needs a feature extractor to first find the number of dogs in each picture in a data set and explicitly state that number in the data (this is going to be some sort of computer vision algorithm). Then that information can be sent to statistical or machine learning techniques for more analysis.

my work in terms of feature extraction

I work on a feature extractor. It extracts qualitative (and some quantitative) information from point clouds (it can also be used on other types of data sets, but that wont be covered here). The problem with this feature extractor is that it outputs information in the form of a module, which is a mathematical object. As far as I know, there are no statistical or machine learning techniques designed for module data, so we need to translate that module data into acceptable input for statistics/ML techniques. That's what I'm doing.

my work in terms of mathematics

From a mathematical point of view I'm looking for invariants of poset modules. Invariants are features that don't change when you make certain alterations to your module. The alterations -- which we call isomorphisms -- don't change the inherent information of the module, but they do change how the module is presented. It's an idea similar to reducing fractions. \(\frac{2}{4}\) and \(\frac{1}{2}\) are two fraction that we've presented or written down differently, but they actually represent the same quantity. Here the isomorphism is reduction of fractions, and the invariant is the actual quantity or value of the number.

Can we design a learning algorithm that learns which features are important? In principle I would say that neural nets already do this, but hand-picked feature extraction and selection seems to be a big part of training a learning algorithm to work correctly. ML and neural nets don't automate data analysis. You still have to choose your features and pick a model, to some extent -- it's just the parameters of that model that get learned.

Wednesday, April 4, 2018

a question to consider when switching fields

As I (try to) decide my career trajectory post-phd, I'm considering something many people in STEM never do: how poorly will I be treated in the field I choose to enter?

I'm female. The fields I have experience in are electrical engineering, robotics, math research, and math teaching. Ordered by how well I was treated, worst to best, that list is: robotics, electrical engineering, math research, math teaching.

I moved more or less directly from robotics to math. At first I thought the math community was a wonderful oasis of respect and welcome. It's not true, of course, but it felt so good to move from explicit, aggressive misogyny to explicit statements about commitment to gender equity (thanks, Bill Pardon, for making me feel like I belonged and was valued).

I'm exploring the idea of adjusting my research towards computer science. Machine learning and AI in general are well-funded and interesting to me, but I worry about the overlap between those areas and robotics and what that overlap suggests about AI research culture. Encouragingly, I think I had the most trouble with mechanical engineers or other researchers interested in designing and constructing the physical robots, so perhaps the problem was concentrated in a niche I wont frequently interact with.

I wish I didn't have to worry about this. I'm honestly not even giving much thought to it because my timing for deciding on this transition is suboptimal and somewhat overwhelming as it is, without cultural considerations. I'll just see how it goes and consider whether I can stand to return to that kind of world. Or I'll see if I have any other interesting options that don't involve subjecting myself to quite so much sexism.

Update: Went to Triangle Machine Learning Day at Duke. Pro: lots of women. Con: got mansplained to, as well as looked up and down. Boo.

Update II: Recalled some of my less happy experiences. After hearing about some of those experiences, advisor says he doesn't want me to go into robotics. Just point blank, "I don't want you to go into robotics," not "you should do as you feel best," not even "it's up to you but it doesn't sound like a good environment." And postdoc I spoke to last week did not have good things to say about treatment of women in ML community. I don't think I can go back to that. I don't want to even imagine trying to enter a new academic field, a new political network, a new community, if that's what I'll find.

Biology. Biology has way more women and that is correlated with better treatment of women, isn't it?

Sunday, April 1, 2018

controllability (and observability) (and identifiability?) for neural nets

I used the word "controllability" while writing another post and it brings an interesting thought to mind: controllability is a technical term. <blank> controllability is the idea that from any initial condition, you can reach any point in the <blank> space in finite time. <blank> could be filled in by "state," "output," and maybe other spaces associated with your system. The way that you reach the desired point in the <blank> space is by changing only the inputs to the system that you have control over. 

Controllability is a row rank type of condition in LTI (and LTD?) systems, so there should be a dual to it that's column rank... internet says it's observability. Probably that you can determine the input or state based on just output? Yeah that sounds right, both in terms of the name "observability" and the technical aspects of duality to controllability. 

What does controllability have to do with machine learning? Well, it might be nice to know, for example, what the possible outputs are for a neural network (output controllability). That might depend on neural net topology or initial conditions in an interesting way and might inform how we choose those topologies and initial conditions. State controllability might be useful for tuning or pretraining in deep learning. Perhaps state controllability could guarantee that we could tune/pretrain the neural net in a modular fashion, tuning/pretraining one layer at a time? 

Similarly, could observability be useful? If a system couldn't tell the difference between inputs that you want distinguished, then that would discount the associated neural net topology and initial conditions as plausible for the application. So that's input observability. State observability could have some uses for interpretability. Though.. just because you know what the state is doesn't mean that it's interpretable by a human, which is the real point of interpretability.  

Can we use controllability (observability) as a way to analyze neural net topology and initial conditions, as a preliminary check to see if they could possibly be appropriate for the application? Or to find the smallest/simplest possible neural nets that could theoretically solve our problems?

Another nice property of a system is identifiability. (Global) identifiability is the idea that, given enough data (any desired inputs and the corresponding outputs from the system), you can determine all the of the parameters of the system. There are different relaxations, like local identifiability, which means that, given enough data, you can determine a finite number of possibilities for the system parameters. 

I've come across identifiability through algebraic statistics, which is a field that... isn't really used in practice yet, as far as I know (says someone from topological data analysis... lol. Though I think persistent homology is used by some statisticians and the work on persistence landscapes to make (finite-dimensional) feature vectors out of persistence is a result in the right direction, and currently there are a bunch of people working towards getting persistent homology fed into machine learning algorithms). I've gotten the impression that identifiability is a theoretical result that is too technically difficult to use in practice. But identifiability seems like it's necessary for state observability, so it should in practice be easier than observability, right? Well, generalizing to observability might make the important facets of the problem more clear so it could seem easier to humans, but ... it's not a good sign. 

So, I guess, in principle, we could ask: Can we use identifiability as a way to analyze neural net topology and initial conditions, as a preliminary check to see if they could possibly be appropriate for the application?

For identifiability, my hopes are not high, nor are my expectations. Which might imply that controllability and observability aren't nice to work with, either. I think I've mostly seen them in linear systems, but... a lot of machine learning is linear algebra, right?

Came across another term: accessibility. It's weaker than controllability. But maybe I'll look into it another time...

Thursday, March 22, 2018

machine learning pipelines and testing vs validation

What's a typical machine learning pipeline? Here's my guess:

  • Start with raw data. 
  • Some entires may be incomplete; either toss those or find a way to complete them. 
  • there could be other preprocessing, like tossing outliers or data that is obviously from sensor malfunction. I don't like the "tossing outliers" idea so much ... you have to be careful with "cleaning" your data to make your analysis works out ... 
  • now feature extraction. Some features may be redundant, the data set might be too big, some of the data might be identified as unnecessary, or maybe the data doesn't obviously describe features we want to study so we want to put it in a human-understandable form. 
  • if the machine learning algorithm is supervised, label the data with "ground truth."
  • split up the data as needed into training and validation sets. 
  • choose appropriate machine learning algorithm and appropriate initial conditions/parameters. 
  • throw training data into machine learning algorithm and validate with validation data according to whatever process you've chosen. May also validate via other methods (like other machine learning algorithms?).
  • Hopefully it works! If not, revise process: cleaning; feature extraction; machine learning technique; or, worst of all, hyperparameter tuning (the learning algorithm was supposed to do this all on its own, wasn't it?). Or, maybe you learn that you don't have enough data for your learned model to stabilize, so you need more data. Iterate on process as needed until you have a validated model. 
  • Now the pipeline is set up. To make predictions about new data, clean and preprocess, feature extract, and put into machine learning algorithm to find the predictions. Interpret as needed. 

So I wrote this all at once, following my intuition from a data science perspective, and now I'm looking back. What's missing? 
  • There's no mention of testing the code to see if it's working properly, only validating the results. Some unit/integration tests would be nice. I supposed validation might be more important than making sure the algorithm is implemented correctly, but still... you don't want to have unreproducible results. 
  • There's no discussion on how to find an "appropriate" machine learning method or initial conditions/parameters. 
  • Do we need to weight certain data differently than other data? Do we need regularization? Are these just parameters that we assume are part of the algorithm? 
  • There's no planning stage. You probably need to plan what algorithms you're going to use before you do anything, including cleaning (maybe your algorithm is robust to outliers so you don't have to "clean" those out! "Data cleaning" gives me the heebie jeebies.). 
  • Furthermore, there's no mention of the application and the desired outcomes: the specific application will have an effect on cleaning, feature selection/extraction, and algorithm choice. 
  • There's no step about feature selection. You have to watch out for collinearity and whether your features are actually predicting anything useful. 
A lot of the above is probably wrapped up in the "iterating" step or can be summed up by "need to better explain planning, testing, and choosing parameters (including algorithms and input features)." Those seem slightly harder than the rest of the pipeline to me. The planning and choosing parameters because you don't necessarily know beforehand what setup is going to work best, and the testing because, well, testing is sometimes hard. I've literally never come across anyone talking about how they test their machine learning code (only how they validate). There are probably data sets out there with known results for certain (basic) algorithms because people make tutorials about learning machine learning. 

Maybe, for say, a neural network, there are some simple inputs (step functions on one input at a time?) where you can trace the effect they have on the whole network to make sure the network edges are set up properly. You might set all weights to 1 to make verification easier. This is my default for testing linear systems so it isn't necessarily all we need to test a nonlinear system, but it's a start. Could you use just step function inputs with all weights set to 1 (and biases set to... 0? something else) to verify that your network topology is as you expect? Hmmm, this actually feels like a kind of interesting side of machine learning algorithms: not just how you write them, but how you test them. Are the algorithms easy to test? What makes an implementation easier to test? Does anyone need such a test for network topology, since there are libraries to set up neural nets? Perhaps if someone got into pruning neural nets (using L1 regularization, maybe?) and they wanted to see what the output of their algorithm was... 

NB: Why do people call it "multicollinearity?" Why does "collinearity" have  two "L"s? Shouldn't it just be "co" and "linearity" put together into "colinearity," where the "multi" and the second "L" are redundant? What am I missing?