An often used tool in my toolbox is polynomial regression to reduce smooth curves down to just a few best fit polynomial coefficients. These coefficients can then be used to do additional analysis that would have been difficult to do with the original (often patchily sampled) curves.

An important question then becomes, "how do I compare two polynomial curves?". It is tempting to use the euclidean distance between the coefficient vectors as a distance between polynomials. However, this (dis)similarity measure doesn't correspond very well to our intuitive understanding of the differences in shape between the corresponding curves.

circdiff_img

I love using k-fold cross validation for my machine learning projects. But especially when I am dealing with neural network models that take hours or even days to train doing a full k-folds style analysis becomes an uncomfortably heavy computational burden. Unfortunately for models with such long training times I usually abandon training an esemble of models and just train one model with a single train/validation split.

I really wanted a way to get at least some of the diagnostic benefits you get from having an ensemble of semi-independently trained models the way you do in K-folds, but without needing to wait days or weeks for my neural nets to train. I started experimenting with weakly coupled mixtures of models. Instead of feeding most of the data to K otherwise independent models as in K-folds why not try feeding just a fraction 1/K of the data to each model and let the models communicate about their parameters with each other in a controlled way. I thought that perhaps by cleverly controlling what information is passed between which models, how often messages are passed, and how information from them may be used I could effectively isolate the information in some data folds from the values of the parameters of some of the models. In this way I could hopefully save some computation time over a k-folds cross validation without sacrificing all of its benefits.

When dealing with data with dozens or hundreds of features one important tool is to look at the correlations between different features as a heat map. Although it is easy to generate a correlation heat map not all such visualizations are created equal. Here are some rules of thumb to keep in mind,

Limit the range of the color map to the middle 99.x% of the values
Use symmetric magnitude bounds
Use a divergent color map
Make 0 correlation correspond to a dull dark color (dark grey), and high magnitude correlations high luminance
Different orderings of the features can have a huge impact, pick wisely.

Using these guidelines together almost always improves the overall quality of the visualization of a correlation or covariance matrix.

We will apply these guidelines one by one to an example data set (see below) talking about the motivation for each guideline as we apply it.

Principal Component Analysis (PCA) is frequently applied in machine learning as a sort of black box dimensionality reduction technique. PCA can be arrived at as an expression of a best fit probability distribution for our data. Treating PCA as a probability distribution opens up all sorts of fruitful avenues, we can draw new examples from the learned distribution and/or evaluate the likelihood of samples as we observe them to detect outliers.

Visualizing Convolution Kernels

Convolution Filters in Neural Networks are Actually Correlation Filters

A simple similarity function for polynomial curves

Loopy Locks and Graphs

Parameter Diffusion

Tips for Visualizing Correlation Matrices

PCA and probabilities

SKM Embedding of MNIST

Eigen-Techno

Smooth Kernel Machines