tag:blogger.com,1999:blog-57437311074630787252020-02-29T00:17:20.240-05:00Ambles in Mathnotes and thoughts on what I'm learningashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.comBlogger15125tag:blogger.com,1999:blog-5743731107463078725.post-43023317341760480342018-04-13T16:19:00.000-04:002018-04-13T16:19:07.346-04:00analyzing data from single-cell RNA sequencingThere's an interesting problem floating around biology circles: biologists sometimes want to study, say, the RNA is just one cell. As in, you don't have a bunch of cells and average the results, you just have one cell to take samples from.<br /><br />Why would you want to use single-cell sequencing? Well, if you worry that you have different types of cells, or cells that have different purposes, then to distinguish between them and avoid averaging over all the different types of cells you'd want to look at one cell at a time.<br /><br /><br /><h2>What's difficult about analyzing single-cell sequencing data?</h2>Averaging is a smoothing operation, so when you're looking at averages you can use some well-behaved models, like linear models. But without averaging, without smoothing, without the law of large numbers... you're left with sparse, noisy data. What model are you going to fit to that?<br /><br /><br />ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-54883551815527466092018-04-13T16:03:00.001-04:002018-04-13T16:03:29.463-04:00data cleaningI saw a talk the other day by Anna Gilbert and it was pretty cool. It was about data cleaning and it was interesting!<br /><br />I had never thought of data cleaning as interesting (or acceptable) because it always seemed like people were throwing out inconvenient data points willy-nilly. Like people would arbitrarily decide what an outlier was and just throw those data out. Horrifying.<br /><br />Anna talked about the issue of getting noisy pairwise distance data. The thing about distance data is that it's supposed to come from a metric, but the noise can mess up the triangle inequality sometimes. Having a metric is usually useful downstream for other computations and can give better guarantees or results. So, in metric repair, the idea is to adjust some of the distances so that all the triangle inequalities are satisfied.<br /><br />Anna's idea is to do sparse metric repair, which means changing as few distances as possible while making the distances satisfy the triangle inequality.<br /><br />She didn't have an application. Someone pointed out that usually, all sensors have a little bit of noise so requiring sparsity isn't really necessary. My thoughts are that yes, sensors are noisy, but they usually give values close to ground truth. But sometimes sensors malfunction and give readings that are way off of ground truth. If you were trying to fix those extreme outliers, you would try to change as few distances as possible, but the ones you did change you might allow to be altered a significant amount. Possible applications are distributed sensing, robot swarms, and DNA sequencing for phylogenetic tree reconstruction.ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-83907163747386619102018-04-05T09:34:00.003-04:002018-04-06T07:52:56.148-04:00generative adversarial networks (GANs) and a weird ML code-phrase<h2>what is a GAN supposed to do? </h2>You input some data and the point is to output data that's similar to the input, but synthetic. You try to infer the distribution underlying your sample, and then you spit out other points from that distribution. So if you have a sample of cute doggy faces, you'd expect to be able to produce new, synthetic pictures of lots of cute doggy faces.<br /><br /><h2>to set up a GAN you need:</h2>GANs use unsupervised learning, so you don't need a labeled data set!<br /><br />You do need adversary neural nets, the actor and the critic. The actor tries to mimic the true data, and the critic tries to tell the difference between the real sample points and the actor's synthetic data. Each learns in turn: First, the actor learns how to outsmart the critic (so the critic cannot differentiate between real and synthetic data), and then the critic learns how to catch the actor (ie, it learns to tell the difference between real and synthetic data), and then the actor learns some more so that it can once again outsmart the critic, and on and on.<br /><br />The one thing I don't quite understand is where the start of this all is. Once the actor and the critic are mostly on track it seems like it wont be hard for them to continue, but each neural net needs the other to measure their success. So which do you train first: the chicken or the egg? And how do you do that training? Or can you really just throw in some random parameter values and expect the system to converge to what you want?<br /><br /><h2>thoughts on GANs </h2>This whole setup just begs for some convergence theorems, doesn't it? And apparently GANs are really finicky to train... which implies that people aren't using good* convergence theorems... which could imply that good convergence theorems don't exist, but could also just imply that good convergence theorems do exists but people aren't <i>using</i> them... Oh, or it could imply that convergence is fine but what you end up with is just not the result you wanted. For example, maybe you need to choose a better set of features as inputs to the neural nets.<br /><br />*What's a "good" (set of) convergence theorem(s)? Well, a theorem should actually work out in practice (shouldn't stability-like theorems always work in practice? that's the whole point!). That means training should finish in the proscribed time, which should be finite and reasonable. That also means the theorem should apply to real applications. A GAN maybe doesn't have to converge for all possible distributions-from-which-input-is-sampled, but if it doesn't apply to a significant** chunk of distributions then we should be able to check whether any given distribution is in that chunk. And then, of course, to be a "good" theorem, we need to know which initial conditions for the parameters lead to convergence.<br /><br />**either significant in size or significant in terms of applications.<br /><br /><h2>questions about norms that aren't normal to me</h2><br /> The topic of convergence theorems for GANs is pretty interesting to me, but at the same time I was learning about GANs I also learned another interesting tidbit:<br /><br />I heard about all this at a talk about GANs from Larry Carin (who perhaps I should cite for all this information? Larry Carin, Triangle Machine Learning Day at Duke, academic keynote talk, April 3, 2018). An audience member from industry was immediately interested in Larry's new method of setting up GANs and wanted to know if papers or code was posted anywhere, and Larry just says "it's under review" and nothing else. Well, he said "it's under review" twice when the person from the audience pushed him on it.<br /><br />So, does he not put preprints on arXiv? If he doesn't, why not? And furthermore, why didn't he explain why the paper isn't publicly available? Is he worried about being scooped? (it's already submitted!) Is he worried about copyright? (Ew. Journals and conferences who don't let you offer free preprints are the worst, but usually an author will still email a copy to a person if they ask.) Is he worried the reviews are going to come back indicating major errors? (then why is he talking about the project?) Doesn't machine learning research move really fast, so shouldn't he want it out? Oooooh, maybe he has a student who's working on a problem based off this work and he doesn't want his student to get scooped. So he's giving the student as much of a head-start as he can.<br /><br />To this last guess, I have:<br />My 1st reaction: "That's sweet of him."<br />2nd: "Wait, no, *tries to think of a way this impacts disadvantaged students* ... hmmmm<br />3rd: "I guess it's bad for the field?"<br />4th: "It must be hard for students in this field. Getting scooped is not fun, especially not for a dissertation. Maybe this is an acceptable protection for someone entering the field."<br />5th: "What if he's protecting someone who isn't just entering the field? He could be doing it so some already-established academic can get a leg up. Is that acceptable?"<br /><br />Instead of falling down this rabbit hole, I'll conclude: Why is he talking about this work now if he has written a manuscript but wont make it available? Well, I guess maybe it is available in the sense that if I email him he might send me a copy. But still, why not post it on arXiv? And whatever the reason is, why wont he explicitly say it's not publicly available yet? There are a lot of norms in the machine learning community that I don't know about. Apparently one of them is that "it's under review" is code for something -- something slightly uncomfortable and thus not to be talked about in mixed company -- and I do not know what that something is.<br /><br /><div><br /></div>ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-76194861323158150472018-04-05T08:45:00.002-04:002018-04-06T07:56:04.713-04:00what i do right nowThere are a lot of different ways to describe what most people do. It's about the context you choose. I'll try a few different contexts here as I try to describe what I do.<br /><br /><h2>feature extraction</h2>features are measurable properties of data. You might have a picture, and a feature might be how many humans are in the picture. You might have a song, and a feature might be the key, or the tempo, or a chord progression. A "good" set of feature gives important information (whatever "important" means); doesn't have overlap, or more than one feature representing the same information; and can differentiate between the data being studied.<br /><br />The important thing to notice about features is that data is not always in a form that makes a given feature easily accessible; data doesn't necessarily directly describe features of interest, they might he indirectly hinted at in complicated ways. This is unfortunate because statistics generally deals with explicit data, not the hidden nuggets of information that we may really want to work with.<br /><br />For example, a picture is a bunch of color data. It explicitly states how much red, blue, and green is at each pixel. What does that tell a human vs a computer about how many dogs are in a picture? Humans are good at extracting the number of dogs from pictures, and we can correctly identify the number for lots of pictures. If you run a basic statistics algorithm on image data, though, it's going to do all of its statistics on colors, not numbers of dogs. A computer needs a feature extractor to first find the number of dogs in each picture in a data set and explicitly state that number in the data (this is going to be some sort of computer vision algorithm). Then that information can be sent to statistical or machine learning techniques for more analysis.<br /><br /><br /><h2>my work in terms of feature extraction</h2>I work on a feature extractor. It extracts qualitative (and some quantitative) information from point clouds (it can also be used on other types of data sets, but that wont be covered here). The problem with this feature extractor is that it outputs information in the form of a module, which is a mathematical object. As far as I know, there are no statistical or machine learning techniques designed for module data, so we need to translate that module data into acceptable input for statistics/ML techniques. That's what I'm doing.<br /><br /><br /><h2>my work in terms of mathematics</h2>From a mathematical point of view I'm looking for invariants of poset modules. Invariants are features that don't change when you make certain alterations to your module. The alterations -- which we call isomorphisms -- don't change the inherent information of the module, but they do change how the module is presented. It's an idea similar to reducing fractions. \(\frac{2}{4}\) and \(\frac{1}{2}\) are two fraction that we've presented or written down differently, but they actually represent the same quantity. Here the isomorphism is reduction of fractions, and the invariant is the actual quantity or value of the number.<br /><br /><br /><br /><i>Can we design a learning algorithm that learns which features are important? </i>In principle I would say that neural nets already do this, but hand-picked feature extraction and selection seems to be a big part of training a learning algorithm to work correctly. ML and neural nets don't automate data analysis. You still have to choose your features and pick a model, to some extent -- it's just the parameters of that model that get learned.ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-17298123615474905662018-04-04T08:00:00.000-04:002018-04-06T07:58:08.913-04:00a question to consider when switching fieldsAs I (try to) decide my career trajectory post-phd, I'm considering something many people in STEM never do: how poorly will I be treated in the field I choose to enter?<br /><br />I'm female. The fields I have experience in are electrical engineering, robotics, math research, and math teaching. Ordered by how well I was treated, worst to best, that list is: robotics, electrical engineering, math research, math teaching.<br /><br />I moved more or less directly from robotics to math. At first I thought the math community was a wonderful oasis of respect and welcome. It's not true, of course, but it felt so good to move from explicit, aggressive misogyny to explicit statements about commitment to gender equity (thanks, Bill Pardon, for making me feel like I belonged and was valued).<br /><br />I'm exploring the idea of adjusting my research towards computer science. Machine learning and AI in general are well-funded and interesting to me, but I worry about the overlap between those areas and robotics and what that overlap suggests about AI research culture. Encouragingly, I think I had the most trouble with mechanical engineers or other researchers interested in designing and constructing the physical robots, so perhaps the problem was concentrated in a niche I wont frequently interact with.<br /><br />I wish I didn't have to worry about this. I'm honestly not even giving much thought to it because my timing for deciding on this transition is suboptimal and somewhat overwhelming as it is, without cultural considerations. I'll just see how it goes and consider whether I can stand to return to that kind of world. Or I'll see if I have any other interesting options that don't involve subjecting myself to quite so much sexism.<br /><br /><br /><br />Update: Went to Triangle Machine Learning Day at Duke. Pro: lots of women. Con: got mansplained to, as well as looked up and down. Boo.<br /><br />Update II: Recalled some of my less happy experiences. After hearing about some of those experiences, advisor says he doesn't want me to go into robotics. Just point blank, "I don't want you to go into robotics," not "you should do as you feel best," not even "it's up to you but it doesn't sound like a good environment." And postdoc I spoke to last week did not have good things to say about treatment of women in ML community. I don't think I can go back to that. I don't want to even imagine trying to enter a new academic field, a new political network, a new community, if that's what I'll find.<br /><br />Biology. Biology has way more women and that is correlated with better treatment of women, isn't it?ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-74380272306301514572018-04-01T09:51:00.000-04:002018-04-06T08:02:57.278-04:00controllability (and observability) (and identifiability?) for neural nets<div>I used the word "controllability" while writing another post and it brings an interesting thought to mind: controllability is a technical term. <blank> controllability is the idea that from any initial condition, you can reach any point in the <blank> space in finite time. <blank> could be filled in by "state," "output," and maybe other spaces associated with your system. The way that you reach the desired point in the <blank> space is by changing only the inputs to the system that you have control over. </div><div><br /></div><div>Controllability is a row rank type of condition in LTI (and LTD?) systems, so there should be a dual to it that's column rank... internet says it's observability. Probably that you can determine the input or state based on just output? Yeah that sounds right, both in terms of the name "observability" and the technical aspects of duality to controllability. </div><div><br /></div><div>What does controllability have to do with machine learning? Well, it might be nice to know, for example, what the possible outputs are for a neural network (output controllability). That might depend on neural net topology or initial conditions in an interesting way and might inform how we choose those topologies and initial conditions. State controllability might be useful for tuning or pretraining in deep learning. Perhaps state controllability could guarantee that we could tune/pretrain the neural net in a modular fashion, tuning/pretraining one layer at a time? </div><div><i><br /></i></div><div>Similarly, could observability be useful? If a system couldn't tell the difference between inputs that you want distinguished, then that would discount the associated neural net topology and initial conditions as plausible for the application. So that's input observability. State observability could have some uses for interpretability. Though.. just because you know what the state is doesn't mean that it's interpretable by a human, which is the real point of interpretability. </div><div><i><br /></i></div><div><i>Can we use controllability (observability) as a way to analyze neural net topology and initial conditions, as a preliminary check to see if they could possibly be appropriate for the application? Or to find the smallest/simplest possible neural nets that could theoretically solve our problems?</i></div><div><i><br /></i></div><div>Another nice property of a system is identifiability. (Global) identifiability is the idea that, given enough data (any desired inputs and the corresponding outputs from the system), you can determine all the of the parameters of the system. There are different relaxations, like local identifiability, which means that, given enough data, you can determine a finite number of possibilities for the system parameters. </div><div><br /></div><div>I've come across identifiability through algebraic statistics, which is a field that... isn't really used in practice yet, as far as I know (says someone from topological data analysis... lol. Though I think persistent homology is used by some statisticians and the work on persistence landscapes to make (finite-dimensional) feature vectors out of persistence is a result in the right direction, and currently there are a bunch of people working towards getting persistent homology fed into machine learning algorithms). I've gotten the impression that identifiability is a theoretical result that is too technically difficult to use in practice. But identifiability seems like it's necessary for state observability, so it should in practice be easier than observability, right? Well, generalizing to observability might make the important facets of the problem more clear so it could seem easier to humans, but ... it's not a good sign. </div><div><br /></div><div>So, I guess, in principle, we could ask: <i>Can we use identifiability as a way to analyze neural net topology and initial conditions, as a preliminary check to see if they could possibly be appropriate for the application?</i></div><div><i><br /></i></div><div>For identifiability, my hopes are not high, nor are my expectations. Which might imply that controllability and observability aren't nice to work with, either. I think I've mostly seen them in linear systems, but... a lot of machine learning is linear algebra, right?<br /><br />Came across another term: accessibility. It's weaker than controllability. But maybe I'll look into it another time...</div>ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-3947917328269590322018-03-22T09:08:00.002-04:002018-04-06T08:09:22.135-04:00machine learning pipelines and testing vs validationWhat's a typical machine learning pipeline? Here's my guess:<br /><br /><ul><li>Start with raw data. </li><li>Some entires may be incomplete; either toss those or find a way to complete them. </li><li>there could be other preprocessing, like tossing outliers or data that is obviously from sensor malfunction. I don't like the "tossing outliers" idea so much ... you have to be careful with "cleaning" your data to make your analysis works out ... </li><li>now feature extraction. Some features may be redundant, the data set might be too big, some of the data might be identified as unnecessary, or maybe the data doesn't obviously describe features we want to study so we want to put it in a human-understandable form. </li><li>if the machine learning algorithm is supervised, label the data with "ground truth."</li><li>split up the data as needed into training and validation sets. </li><li>choose appropriate machine learning algorithm and appropriate initial conditions/parameters. </li><li>throw training data into machine learning algorithm and validate with validation data according to whatever process you've chosen. May also validate via other methods (like other machine learning algorithms?).</li><li>Hopefully it works! If not, revise process: cleaning; feature extraction; machine learning technique; or, worst of all, hyperparameter tuning (the learning algorithm was supposed to do this all on its own, wasn't it?). Or, maybe you learn that you don't have enough data for your learned model to stabilize, so you need more data. Iterate on process as needed until you have a validated model. </li><li>Now the pipeline is set up. To make predictions about new data, clean and preprocess, feature extract, and put into machine learning algorithm to find the predictions. Interpret as needed. </li></ul><div><br /></div><div>So I wrote this all at once, following my intuition from a data science perspective, and now I'm looking back. What's missing? </div><div><ul><li>There's no mention of testing the code to see if it's working properly, only validating the results. Some unit/integration tests would be nice. I supposed validation might be more important than making sure the algorithm is implemented correctly, but still... you don't want to have unreproducible results. </li><li>There's no discussion on how to find an "appropriate" machine learning method or initial conditions/parameters. </li><li>Do we need to weight certain data differently than other data? Do we need regularization? Are these just parameters that we assume are part of the algorithm? </li><li>There's no planning stage. You probably need to plan what algorithms you're going to use before you do anything, including cleaning (maybe your algorithm is robust to outliers so you don't have to "clean" those out! "Data cleaning" gives me the heebie jeebies.). </li><li>Furthermore, there's no mention of the application and the desired outcomes: the specific application will have an effect on cleaning, feature selection/extraction, and algorithm choice. </li><li>There's no step about feature selection. You have to watch out for collinearity and whether your features are actually predicting anything useful. </li></ul><div>A lot of the above is probably wrapped up in the "iterating" step or can be summed up by "need to better explain planning, testing, and choosing parameters (including algorithms and input features)." Those seem slightly harder than the rest of the pipeline to me. The planning and choosing parameters because you don't necessarily know beforehand what setup is going to work best, and the testing because, well, testing is sometimes hard. I've literally never come across anyone talking about how they test their machine learning code (only how they validate). There are probably data sets out there with known results for certain (basic) algorithms because people make tutorials about learning machine learning. </div><div><br /></div><div>Maybe, for say, a neural network, there are some simple inputs (step functions on one input at a time?) where you can trace the effect they have on the whole network to make sure the network edges are set up properly. You might set all weights to 1 to make verification easier. This is my default for testing linear systems so it isn't necessarily all we need to test a nonlinear system, but it's a start. <i>Could you use just step function inputs with all weights set to 1 (and biases set to... 0? something else) to verify that your network topology is as you expect?</i> Hmmm, this actually feels like a kind of interesting side of machine learning algorithms: not just how you write them, but how you test them. Are the algorithms easy to test? What makes an implementation easier to test? Does anyone need such a test for network topology, since there are libraries to set up neural nets? Perhaps if someone got into pruning neural nets (using L1 regularization, maybe?) and they wanted to see what the output of their algorithm was... </div><div><br /></div><div><br /></div><div><i>NB</i>: Why do people call it "multicollinearity?" Why does "collinearity" have two "L"s? Shouldn't it just be "co" and "linearity" put together into "colinearity," where the "multi" and the second "L" are redundant? What am I missing?</div></div>ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-90348763834570032552018-03-21T08:46:00.004-04:002018-03-22T09:19:57.345-04:00neural nets: brute force?I don't think neural nets are brute force nor black boxes, but that's what I've heard some people say. So that's what I thought for a little while...<br /><b><br /></b>Neural nets seem like brute force because you generally need huge training sets to get good results out of them (no comment on what "huge" and "good results" mean).<br /><br />Now, to make a comparison between brute force and neural nets, I do want to first "define" what brute force is. To me, brute force for a neural network would be: Take all possible inputs and all possible configurations of the neural net parameters. Find the error associated to each neural net parameter set by computing the error for each input and then combining those errors into a total error for that parameter set (how do we measure and combine errors? I don't know, but I hope it's continuously). Do this for all parameter sets and choose a parameter set with a minimal error.<br /><br />Caveat: does a minimum exist? An infimum does, but whether that's attainable is up to analysis. Neural nets are continuous maps from input to output, and for fixed input the map from parameters to error should be continuous, and we should probably define the parameters on a closed domain (but that depends on the applications I guess). Is that domain compact? That's really up to the application... So "brute force" doesn't always even give us a parameter set unless we have a compact domain for the parameters. Ew... well unimportant. The point is that neural nets are not brute force, even if they can have issues with stabilizing at a set of parameters. That's something to look into: <i>When do optimal parameters for a neural net exist?</i><br /><br />One argument I've heard against the "brute force" accusation (strong word choice much?) is using the example of image processing: How big is the configuration space for an image, and do we think a neural net is using all that information?<br /><br /><h3><b>configuration space of a 720p rgb image</b></h3>If you look at just one pixel, you have to specify the red, green, and blue components, each of which can take integer values between 0 and 255 inclusive. That give 256^3=16,777,216 different colors that each pixel can represent.<br /><br />If you have 1280 by 720 pixel images then you get 921,600 pixels total. So, for a 720p rgb image, we're looking at 16,777,216^921,600 configurations. Which is a lot. For context, apparently there are 10^82 atoms in the known, observable universe <a href="https://www.universetoday.com/36302/atoms-in-the-universe/" target="_blank">[1]</a>. So, do we really think neural nets are working through every conceivable image and updating accordingly?<br /><br />Well, no one is training neural nets on 16,777,216^921,600-image training sets. But usually the images of interest do not span that whole configuration space; they probably cluster in a couple of areas. Perhaps if you restrict to just those areas, neural nets are basically brute-forcing it.<br /><br />So maybe this argument is a bust? I can't really finish it satisfactorily at this point in time.<br /><br /><h3>looking at the algorithm of neural nets</h3><div>An argument that makes better sense to me for why neural nets are not brute force: neural nets are actually tracing a path through their configuration space and looking for a local minimum of a cost function. That in no way implies they are trying every conceivable combination of parameter values. They're only looking at (a discrete sampling of) a 1-dimensional piece of a definitely more than 1-dimensional parameter space! Neural nets don't always even reach the optimal configuration because they get stuck on local minima of the cost function. Brute force would have at least gotten the global minimum (again, if it exists...)!</div><div><br /></div><h3>in conclusion...</h3><div>Neural nets are not brute force. They may seem like it because you start a neural net with "no" (ie, comparatively little) information about the system it's predicting and then throw a whole lot of training data at it to make it work. And some neural nets do work really, really well. But even though they're "slow" to train, they're not as slow as brute force. And even though neural nets can work really, really well, there are rarely (never? only sometimes?) guarantees that they produce (or even approach) optimal parameters, whereas brute force should be perfect in that regard. </div><div><br /></div><div>And "brute force" isn't even a good analogy for neural nets! It's more like the neural nets make fewer assumptions, and so they need more training than models that make more assumptions (and thus start off kind of "pre-trained"). They're still a kind of heuristic, without all the glorious guarantees of optimal parameters and horrendously long computation times that come attached to brute force. </div>ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-34647993145278464432018-03-21T08:00:00.001-04:002018-04-06T08:18:26.557-04:00functional programming: from confusion to warm and fuzzy feelingsWhen I first heard of functional programming, it was explained at "stateless." From the object-oriented perspective I was raised with, this seemed crazy. How could you make something stateless? How do you organize everything? Well, you just pass information from function to function. But still, I didn't get it. What do you mean functions always do the same thing, no matter what you pass in? You can pass in all sorts of different information as input? What?<br /><br />Then, Dmitri Vagner said that functional programming is just a categorical way of thinking about programming, and it all made sense. Of course functions don't update state! They're just functions, and they have an input and an output but they don't change the input; they just give a new object (object?) that is the output. A program just has a bunch of compositions of functions to get done what you need to get done, and you can say things about the compositions of functions without talking about specific inputs since you can say things about <i>all</i> the inputs at once (it's a function. It inherently knows about its domain and codomain. functions are lots of times the sole focus of study because functions hold all of the information, while objects can't even tell how they relate to one another!). Of course functional programing is (maybe? sometimes?) more efficient, since you can use your functor for anything in the category you're working in (and some functors work for many categories!). Of course you don't need state for that. You set up a useful system and to do a specific example, you just plug in the input and get some output.<br /><br />And the immutable data of functional programming is not so scary; it's just different. Instead of updating an object (and possibly losing track of what you have and haven't done to it) you just make another one. And immutable data is just about the opposite of a global (mutable) variable, which is the bane of debugging and one of my biggest pet peeves. Macaulay2, I will never forgive you for having definition by equals sign give a global variable. It is sacrilege! But functional programming, you're newly added to my todo list. You're kind of near the end because you're not necessary at the moment, but you're on there.ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-4846224458594639102018-03-20T12:21:00.001-04:002018-04-06T08:23:31.006-04:00linear methods of machine learning: basic regression and classification by least squares<h3>linear regression by least squares</h3>plot training data and the hyperplane defined by the linear regression output (\(Y=X\beta\)) (extend the \(X\) vector by one dimension to include an extra \(1\) to correspond to the constant term). Draw the errors, which connect points in the training data \((x_i,y_i)\) to the points that would be predicted by the model \((x_i,x_i\beta)\) (when X and Y are scalar, these are vertical lines on the graph). We want to minimize the sum of the squares of these errors, which you can do by taking the derivative of the sum of squares with respect to \(\beta\), setting that to zero, and looking at the solutions (and endpoints) to find the smallest one (which does exist because all values are non-negative).<br /><br />Note: \(x_i\) input vectors are augmented with a 1 inserted into the first index of the vector. This is so the offset coefficient, \(\beta_0\), can be included in the matrix notation. Then we can consider this as a linear problem instead of an affine one, and the only sacrifice seems to be just increasing the dimension by 1.<br /><br /><h3>classification using a linear model</h3>For using a linear model for classification, say of two categories, assign one class as 1 and the other as 0 (for n>2 classes, I assume you start using the standard basis vectors \(e_1,\ldots,e_n\) to denote each class, but then a hyperplane only splits Euclidean space into two parts...). Use that as the Y for the input X in your training data.<br /><br />What we want is a hyperplane in the same space as X that separates the Y=0 and Y=1 points (which is a hyperplane of dimension dim(X)-1). When you do linear regression on this training data, you get a hyperplane in ambient dimension dim(X)+dim(Y)=dim(X)+1 (thus this hyperplane is dimension dim(X), which is not what we want). To find the hyperplane in the ambient space of X that tries to separate the data, find the hyperplane \(\{X : X\beta=0.5\}\). This takes our linear regression hyperplane down one dimension to dim(X)-1, and is still linear.<br /><br />Why did we choose \(X\beta=0.5\) in particular? What's special about 0.5? Well it's halfway between Y=0 and Y=1. And, it's the same cutoff we use for our predictions: if \(X\beta > 0.5\), then that point is classified as Y=1; else, Y=0.ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-14097875118286034732018-03-18T09:53:00.000-04:002018-03-18T09:53:20.724-04:00overfitting in neural nets<h2>the issue: overfitting</h2>A neural network is a model, which begs the question: when is the model susceptible to overfitting?<br /><div><br /></div><div>(Overfitting is when a model fits a training set extremely well, to the point that it learns the idiosyncrasies of the data chosen, including the specific errors. As a result the model fares poorly on data outside the training set; ie, it doesn't generalize well.)</div><div><br /></div><div>"With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." - John von Neumann, exclaiming that using four or more parameters in a model almost always leads to overfitting. </div><div><br /></div><div>So is four parameters too many parameters for a neural net? I hope not, because that would make a very constrictive upper bound on the size of useful neural nets. </div><div><br /></div><h2>the solutions</h2><div>Obviously researchers have determined that neural nets with more than even five parameters can still be useful. So, why is that, or what tricks do people use to prevent overfitting?</div><div><br /></div><h3>choosing an appropriate network topology</h3>If you make a neural network too big for the applications at hand your neural network might overfit your training set. Finding the appropriate size and topology of a neural net without trial and error sounds hard, though. But at least neural nets with more than five neurons have been useful, so von Neumann's bound for overfitting doesn't seem to apply.<br /><br /><h3><b>regularization</b></h3><div>This involves adding terms to the cost function. For Lp regularization (generally with p=1 or p=2) add terms \(\lambda||w_i||_p\) to the cost function for each of the weight parameters \(w_i\) with regularization strength \(\lambda\). L2 regularization prevents any one weight from becoming too large, because apparently having "peaky" weights is bad. L1 regularization just keeps the total of the absolute values of the weights from becoming too big, which tends to drive some of the weights to almost zero. That's kind of cool, that L1 regularization essentially sparsifies the neural net. Apparently L2 regularization works better, though.<br /><br />Notice that we didn't regularize all the parameters; we left out the bias parameters. Some justifications I've seen are that<br /><br /><ul><li>there are so few bias parameters compared to weight parameters that it doesn't matter</li><li>bias parameters aren't multiplicative on the inputs so they don't matter as much?</li><li>tests show that adding bias regularization doesn't hurt performance but it doesn't help either, so we should just leave bias regularization out of our neural nets. </li></ul></div><h3><b>early stopping</b></h3><div>Implementing early stopping in your machine learning algorithm means you have some rule that you check at regular intervals while training. The rule tells you whether it's ok for the model to keep training or whether you need to stop training now, despite still having more training data to use.<br /><br />An example of an early stopping rule is holdout-validation. Essentially, you break you training set up into smaller training and validation sets. At regular intervals, you test some of the validation sets and record how much "generalization error" (under what metric I have no idea) the model has at that point. At some point, the pattern of the recorded validation set errors will indicate that overtraining has begun (the simplest check is that once the error increases instead of decreasing, you've hit overtraining). Take the last version of the model before you hit overfitting as your final model.<br /><br /></div><h3><b>adding noise</b></h3><div>I've heard that some people add noise to training data in order to prevent overfitting. I don't remember what this technique is called or exactly what it entails. Maybe I'll find it soon. </div>ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-17337338962643466722018-03-17T15:56:00.001-04:002018-03-21T08:53:29.902-04:00resourcesSome resources that have been recommended to me or I've found for machine learning:<br /><br /><h2>Visualizers</h2><a href="http://playground.tensorflow.org/" target="_blank">A Neural Network Playground</a> visualizes a neural net and lets you fiddle with parameters and topology. By Daniel Smilkov and Shan Carter.<br /><br /><h2>Blogs</h2><a href="https://brohrer.github.io/blog.html" target="_blank">Data Science and Robots</a> a blog with mostly video content about basic machine learning topics by Brandon Rohrer.<br /><div><br /></div><h2>Textbooks</h2><a href="http://neuralnetworksanddeeplearning.com/" target="_blank">Neural Networks and Deep Learning</a> a free online textbook by Michael Nielsen.<br /><br /><a href="https://web.stanford.edu/~hastie/Papers/ESLII.pdf" target="_blank">Elements of Statistical Learning 2nd ed</a> (ESL) is a reference on machine learning topics and assumes statistical/mathematical background. By Hastie, Tibshirani and Friedman. Can be found free online.<br /><br /><a href="http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf" target="_blank">Introduction to Statistical Learning: With Applications in R</a> (ISL) is a reference on machine learning topics that is apparently simpler than ESL. By James, Witten, Hastie and Tibshirani. Can be found free online.<br /><br />Handbook: Statistical foundations of machine learning by Gianluca Bontempi is supposed to have many fewer prerequisites than the ISL or ESL. Can be found free online.ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-91111302785697565462018-03-17T15:50:00.001-04:002018-04-06T08:28:39.987-04:00neural nets: black boxes?Almost 10 years ago I heard neural networks were essentially black boxes: once trained, their parameters to meant nothing to humans. They were weird algebraic (not even algebraic?) combinations of (possible human-meaningful) features, but there was no way to know which features were involved and what combinations of the features contributed to each parameter.<br /><br />Definition: deep neural network. A neural network with more than one hidden layer.<br /><br />Now with all the deep learning research, I see that there's a meaningful structure to neural nets. Each hidden layer is like its own (non-deep) neural net, and spits out a vector of features. Those features are the input to the next layer, and that second layer combines them to get more complex features. So you end up with this hierarchy, where the first layer makes the simplest features and they combine and build up over the layers to make more and more complex features, until the output of the neural net is a feature vector representing the most intricately detailed features of all.<br /><br />I saw an example somewhere (can't find it now...) of a neural net with 3 hidden layers where the researchers somehow got the first layer to represent lines and curves of different shapes, the second hidden layer assembled these curves into the facial features we're all familiar with, and the final hidden layer produced full faces, each of which was significantly different from the next. That's a great example of how the layers of a neural net build upon one another.<br /><br />Definition(?): deep learning. A machine learning paradigm where there are many layers or levels of complexity, where each level builds upon the one preceding it to add a level of abstraction. Deep neural networks are an example of deep learning.<br /><br /><b>stuff i still don't know</b><br /><br />I don't know what kind of tuning these facial-recognition researchers did to get such human-meaninful features out of their neural net. I don't know if that's a common thing, or if this was a nice example that doesn't generalize. I would guess that it's the latter, and that a lot of the parameters that neural networks decide on are essentially meaningless to us outside of the context of the rest of the neural net.<br /><br />I also don't know <i>what added complexity a neural network can handle if it's "wide" (as opposed to deep).</i> For comparison, if adding more layers (and thus turning from a neural network to a deep neural network) adds a hierarchy to the feature extraction in neural nets, what does adding width do? That is, what happens when you add more nodes to a layer?ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-80372365447629876172018-03-15T18:56:00.001-04:002018-03-22T09:13:33.555-04:00my introduction to neural networksThe first encounter: My advisor said neural networks were slow and couldn't accomplish complicated behavior way back in 2010. No comment on whether that was true then or is true now.<br /><br />From my second discussion about neural nets I "learned" that neural nets are essentially brute force. <br /><br />Darryl Wade was my third neural nets discussion partner and the first person to describe neural nets to me as a rich topic, the first to explain them as more than a trendy black box. Thanks, Darryl.<br /><br /><b>what's a neural network?</b> <br /><b><br /></b>Say you have some data and some features you want to extract from that data. If you put together a training set of data (input data + their corresponding features/ground truth) you can (try to) set up a neural network to do your feature extraction for any data you collect in the future.<br /><br />The set up is this: the neural network is made of nodes organized into "layers." You have some input nodes, which map into the first layer of nodes, which map into the next layer of nodes ... which map into the output nodes. Nodes in the same layer do not map into each other.<br /><br />A deep neural network just has more than one layer of internal/hidden/latent (not input or output) nodes.<br /><br />Each node takes in input, does something non-linear to it, and passes the result on to the next node. The parameters of the neural net are what define the non-linear functions in each node. When you put training data through the neural net you compare the output (which represents the features) to the ground truth and use the difference of the two to tweak the parameters of the neural net.<br /><br />Ok, but really, how do you tweak those parameters? As it trains, the neural network traces a path through the parameter space using hill climbing. The hill climbing is optimizing for a cost function (distance between output and ground truth). Apparently this has a fancy name ... backpropagation. Why do we need a fancy neural nets name for hill climbing? This idea will need more exploration.<br /><br /><br /><b>questions that arise from this "definition" </b><br /><br />For a system that I originally thought was grab and go (just set up your cost function and flip the switch!), there sure are a lot of choices to be make before you start training. How does choosing all of the following affect the predictive value of final parameters and the time needed to reach parameters that are "good enough?"<br /><ul><li>initial conditions</li><li>ordering of training data</li><li>step size for hill climbing (once you've picked the direction you're going to move in the parameter space, how far do you go?)</li><li>nonlinear functions in each node</li><li>metric for comparing neural net output to training data ground truth</li><li>And what might interest me the most (as of now): topology of the neural net</li></ul><div>And there are probably more ... </div><div><br /></div><div><b>ideas for future posts and more questions </b></div><div><ul><li>Neural nets are not brute force. </li><li>How does backpropagation work and what problem is it fixing? </li><li>What are other ways to use neural nets? </li><li>What do the layers do? hierarchical feature extraction?</li><li>What happens when the topology of the neural net has directed cycles?</li><li>Can you start with an overkill neural net (lots of layers and lots of nodes) and then prune it?</li><li>Are there ever substitutions/refactorings of the nodes that give you the same output with related response to training? Maybe you could condense a neural network or refactor it into one that's better understood. </li></ul></div>ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0tag:blogger.com,1999:blog-5743731107463078725.post-13240848935604000392018-03-15T16:44:00.000-04:002018-04-06T08:32:05.464-04:00first stepsThis is a place to keep notes on interesting ideas I come across and questions those ideas elicit. The motivation for starting a blog now is the existential "crisis" I'm wandering into as I try to find a new place (field?) to work post-PhD. Now, at the start, it looks like the focus of the blog will be the interplay of geometry, topology, and machine learning, possibly with some specific examples in perception. But the topics are mostly dependent on where curiosity lures me. ashleighhttp://www.blogger.com/profile/03982922819686581455noreply@blogger.com0