Visualizing Convolution Kernels

I very rarely see any sort of inspection being done on the convolutional kernels of a CNN. In part this is because the parameters themselves are far more difficult to interpret than the outputs of a network (even intermediate outputs a.k.a the network activations). This difficulty of interpretation is worst for kernels with a small spatial footprint and unfortunately 3x3 kernels are the most performant and popular choice. Trying to understand the structure of a 3x3 convolution kernel by looking at all of the possible 3x3 spatial slices is somewhat like trying to guess what an full image looks like from being shown all the 3x3 chunks of it in random order.

Despite the difficulties I think good kernel visualizations are a worthwile pursuit. Good visualization techniques can be powerful diagnostics and the better the visualizations of our models the more powerful and robust we can make them. As a motivational carrot here is a teaser plot of a visualization of a simple network which we generate in this post.

teaser_plot

Read more…

Convolution Filters in Neural Networks are Actually Correlation Filters

The phrase "convolution" when used in the context of neural networks doesn't mean the same thing as when it is used in other contexts (for example numpy.convolve or scipy.signal.convolve). Instead of "convolution" the term should probably be "correlation" in order to line up with the terminology that every one else uses. The distinction isn't often important when dealing with neural networks but every so often it will come to bite me when I make assumptions about the behavior of neural network "convolutions" based on the mathematicians definition of a convolution.

Read more…

A simple similarity function for polynomial curves

An often used tool in my toolbox is polynomial regression to reduce smooth curves down to just a few best fit polynomial coefficients. These coefficients can then be used to do additional analysis that would have been difficult to do with the original (often patchily sampled) curves.

An important question then becomes, "how do I compare two polynomial curves?". It is tempting to use the euclidean distance between the coefficient vectors as a distance between polynomials. However, this (dis)similarity measure doesn't correspond very well to our intuitive understanding of the differences in shape between the corresponding curves.

Read more…

Loopy Locks and Graphs

A friend of mine introduced me to a long running series of puzzles called "the riddler" that are put out by 538 once a week. The harder "riddler classic" this week was interesting enough that I decided to give it a try. The code I wrote while messing around with the riddle was interesting enough that I thought I would share.

Read more…

Parameter Diffusion

circdiff_img

I love using k-fold cross validation for my machine learning projects. But especially when I am dealing with neural network models that take hours or even days to train doing a full k-folds style analysis becomes an uncomfortably heavy computational burden. Unfortunately for models with such long training times I usually abandon training an esemble of models and just train one model with a single train/validation split.

I really wanted a way to get at least some of the diagnostic benefits you get from having an ensemble of semi-independently trained models the way you do in K-folds, but without needing to wait days or weeks for my neural nets to train. I started experimenting with weakly coupled mixtures of models. Instead of feeding most of the data to K otherwise independent models as in K-folds why not try feeding just a fraction 1/K of the data to each model and let the models communicate about their parameters with each other in a controlled way. I thought that perhaps by cleverly controlling what information is passed between which models, how often messages are passed, and how information from them may be used I could effectively isolate the information in some data folds from the values of the parameters of some of the models. In this way I could hopefully save some computation time over a k-folds cross validation without sacrificing all of its benefits.

Read more…

Tips for Visualizing Correlation Matrices

When dealing with data with dozens or hundreds of features one important tool is to look at the correlations between different features as a heat map. Although it is easy to generate a correlation heat map not all such visualizations are created equal. Here are some rules of thumb to keep in mind,

  • Limit the range of the color map to the middle 99.x% of the values
  • Use symmetric magnitude bounds
  • Use a divergent color map
  • Make 0 correlation correspond to a dull dark color (dark grey), and high magnitude correlations high luminance
  • Different orderings of the features can have a huge impact, pick wisely.

Using these guidelines together almost always improves the overall quality of the visualization of a correlation or covariance matrix.

We will apply these guidelines one by one to an example data set (see below) talking about the motivation for each guideline as we apply it.

Read more…

PCA and probabilities

Principal Component Analysis (PCA) is frequently applied in machine learning as a sort of black box dimensionality reduction technique. PCA can be arrived at as an expression of a best fit probability distribution for our data. Treating PCA as a probability distribution opens up all sorts of fruitful avenues, we can draw new examples from the learned distribution and/or evaluate the likelihood of samples as we observe them to detect outliers.

Read more…

SKM Embedding of MNIST

I recently thought up a machine learning algorithm called "Smooth Kernel Macines" (SKM). In this post I will try out SKM on the ever ubiquitous MNIST dataset. The goal of this post is not so much to achieve state of the art performance on MNIST (though that would be nice), as it is to simply try out SKM on a familiar and well understood dataset.

tldr; I achieve a respectable 0.006 error rate using an SKM type layer on top of a convolutional neural net feature extractor. An SKM output layer works a little better than a K way softmax (at least for MNIST). SKM trains faster, and comes with an accurate built in measure of prediction confidence.

Read more…

Eigen-Techno

I recently found an analysis of techno music using principal component analysis (PCA).

https://www.math.uci.edu/~isik/posts/Eigentechno.html

The author, Umut Isik, extracted more than 90,000 clips of one bar worth of music from 10,000 different techno music tracks and stretched the music to all have the same tempo of 128 bpm. Then Isik fed those tracks through PCA and then analyzed the resulting principal vectors and the quality of music approximation with a growing number of components. Since the principal vectors are modeling sound we can listen to them which is rather fun. I won't cover all the same material as Isik did in that post and it is worth a read so nip over and give it a look if you haven't already. Isik has very kindly bundled up the source data and made it available for others to use (theres a link at the end of the post linked above). Playing around with such a fun data set is too tempting to resist so I'm going to do my own "eigentechno" analysis here.

Read more…

Smooth Kernel Machines

My two favorite types of machine learning models are the support vector machine (SVM) and the neural network. Neural networks are great for making sense of a huge amount of training examples with each datum having hundreds or even hundreds of thousands of features. But neural networks don't tend to perform as well as other models on small data sets with only a few hundred data points or a very small numbers of features. Data sets with a few hundred to a few thousand training examples and a hand full of features are the sweet spot for SVM's. But where neural networks are great at engineering their own features SVM's require careful feature selection for good performance.

Because of their very different strengths and weaknesses it can sometimes be fruitful to combine SVM and a neural network together into a single model. The way in which I have seen this done previously is to use the neural network as a feature extractor for the SVM. The neural network is trained normally and then as a post processing step the final layer of the neural network is removed and replaced with a SVM which is fed the activations of the second to last layer as features. The ability of a neural network to compress a large amount of information down to a few relevant features and SVM's ability to squeeze the most out of a small number of data points is a great combination.

Although this method works it isn't very satisfying. Often the features that work best as the last layer of features in a neural network are very different than the features that would work best as features for an SVM classifier. A fresh hyper parameter search for the SVM after every tweak to the neural network works but is painful and awkward. Not to mention that there is no guarantee that the neural network features will be well adapted for use by the SVM even with a hyper parameter search over kernel parameters. Wouldn't it be nice if instead of learning an SVM as a post processing step we could incorporate it in to the neural network architecture directly?

Read more…