Friday, April 13, 2018

data cleaning

I saw a talk the other day by Anna Gilbert and it was pretty cool. It was about data cleaning and it was interesting!

I had never thought of data cleaning as interesting (or acceptable) because it always seemed like people were throwing out inconvenient data points willy-nilly. Like people would arbitrarily decide what an outlier was and just throw those data out. Horrifying.

Anna talked about the issue of getting noisy pairwise distance data. The thing about distance data is that it's supposed to come from a metric, but the noise can mess up the triangle inequality sometimes. Having a metric is usually useful downstream for other computations and can give better guarantees or results. So, in metric repair, the idea is to adjust some of the distances so that all the triangle inequalities are satisfied.

Anna's idea is to do sparse metric repair, which means changing as few distances as possible while making the distances satisfy the triangle inequality.

She didn't have an application. Someone pointed out that usually, all sensors have a little bit of noise so requiring sparsity isn't really necessary. My thoughts are that yes, sensors are noisy, but they usually give values close to ground truth. But sometimes sensors malfunction and give readings that are way off of ground truth. If you were trying to fix those extreme outliers, you would try to change as few distances as possible, but the ones you did change you might allow to be altered a significant amount. Possible applications are distributed sensing, robot swarms, and DNA sequencing for phylogenetic tree reconstruction.

No comments:

Post a Comment