Convolution Filters in Neural Networks are Actually Correlation Filters

Tim Anderton

The phrase "convolution" when used in the context of neural networks doesn't mean the same thing as when it is used in other contexts (for example numpy.convolve or scipy.signal.convolve). Instead of "convolution" the term should probably be "correlation" in order to line up with the terminology that every one else uses. The distinction isn't often important when dealing with neural networks but every so often it will come to bite me when I make assumptions about the behavior of neural network "convolutions" based on the mathematicians definition of a convolution.

What's the difference?¶

In signal processing the cross correlation of two signals is calculated by shifting one of the signals relative to the other then multiplying the stationary signal and the shifted signal together and taking the sum. By repeating this process for every possible shift you get the full cross correlation between the two signals as a function of shift. The convolution of two signals is almost the same as calculating the cross correlation except that you first reverse the direction of one of the two signals (e.g. reverse the time axis in a time series or flip the left right and up down directions in an image).

TensorFlow doesn't do this flipping of the inputs and so you ought to expect it to work more like np.correlate than np.convolve.

Lets fire up a TensorFlow session and verify that the tf.nn.conv2d convolution operation works in the way I've described.

In [7]:

import numpy as np
np.random.seed(789456)
import matplotlib.pyplot as plt

import tensorflow as tf

In [8]:

sess = tf.Session()

#generate a few random images with 2 input channels
n_examples = 2
rimg = np.random.normal(size=[n_examples, 5, 5, 2])
#and a random kernel with 2 input channels and 1 output channel
rk = np.random.normal(size=[3, 3, 2, 1])

#generate some placeholders with floating sizes so that we can do any shape convolution we like
img_ph = tf.placeholder(shape=[None, None, None, None], dtype=tf.float32)
kernel_ph = tf.placeholder(shape=[None, None, None, None], dtype=tf.float32)
conv_op = tf.nn.conv2d(img_ph, kernel_ph, strides=[1, 1, 1, 1], padding="SAME")

#apply the tf.nn.conv2d op (really it is a correlation filter, see below)
cout = sess.run(
    conv_op, 
    feed_dict={
        img_ph:rimg, 
        kernel_ph:rk
    }
)

Now lets verify that each patch times the convolution kernel is equal to the output at the apropriate location.

In [9]:

for example_idx in range(n_examples): #iterate over examples
    for center_i in range(1, rimg.shape[1]-1): #iterate over rows
        for center_j in range(1, rimg.shape[2]-1): #iterate over columns
            #extract the patch in the input image centered at i,j
            patch = rimg[example_idx, center_i-1:center_i+2, center_j-1:center_j+2]
            patch_product_sum = np.sum(patch*rk.squeeze())
            #compare to the tf.nn.conv2d output
            print(cout[example_idx, center_i, center_j] - patch_product_sum)

[  2.98023224e-07]
[ -4.76837158e-07]
[ -4.76837158e-07]
[  5.96046448e-08]
[  1.19209290e-07]
[  2.38418579e-07]
[ -7.15255737e-07]
[ -4.76837158e-07]
[ -3.57627869e-07]
[ -6.25848770e-07]
[  4.76837158e-07]
[  3.57627869e-07]
[  3.57627869e-07]
[  4.76837158e-07]
[ -2.68220901e-07]
[ 0.]
[ -3.57627869e-07]
[ -5.96046448e-07]

The $10^{-7}$ scale errors are due to the machine precision differences and are to be expected. Note that these precision differences are quite a bit larger than the $10^{-15}$ ish precision differences you might be used to from something like numpy. This is because the convolution operation is being carried out on my GPU which uses 32 bit floats instead of the now standard 64 bit floats of modern CPU instruction sets.

But don't expect the TensorFlow conv2d, or the convolution operation of any other popular neural network package, to work like the convolve functions in numerical packages like numpy or scipy.

In [4]:

import scipy.signal

#use a scipy convolution function (manually summing over each input channel)
scipy_conv = scipy.signal.convolve2d(
    rimg[0, :, :, 0], 
    rk[:, :, 0, 0], 
    mode="same"
) + scipy.signal.convolve2d(
    rimg[0, :, :, 1], 
    rk[:, :, 1, 0], 
    mode="same"
)

np.mean(np.abs(cout[0].squeeze()-scipy_conv))

Out[4]:

2.9128387681701242

But if we flip the directions of one of the kernels then we can turn the convolution operation into a correlation operation and get the same output as we get from TensorFlow.

In [5]:

scipy_conv_krev = scipy.signal.convolve2d(
    rimg[0, :, :, 0], rk[::-1, ::-1, 0, 0], mode="same"
) + scipy.signal.convolve2d(
    rimg[0, :, :, 1], rk[::-1, ::-1, 1, 0], mode="same")

np.mean(np.abs(cout[0].squeeze()-scipy_conv_krev))

Out[5]:

3.1133651543502341e-07

We also can just leave the kernel the way it is but use a correlation operation instead.

In [10]:

scipy_corr = scipy.signal.correlate2d(
    rimg[0, :, :, 0], 
    rk[:, :, 0, 0], 
    mode="same"
) + scipy.signal.correlate2d(
    rimg[0, :, :, 1], 
    rk[:, :, 1, 0], 
    mode="same"
)

np.mean(np.abs(cout[0].squeeze()-scipy_corr))

Out[10]:

3.1133651560821819e-07

When Does it Matter?¶

Within the context of training neural nets the difference between a convolution and a cross correlation is nearly irrelevant. In almost all cases the kernel of the convolution is a learned kernel and so the distinction between learning a correlation template or learning a flipped convolution kernel is moot. But sometimes you might want to use TensorFlow to do more normal signal processing tasks like low pass filtering or wavelet expansion using designed filters. In such cases it can be very important to remember to manually flip your filters prior to putting them into the TensorFlow conv operations. Or, you might want to do correlation filtering for object detection with a known template. If you are like me you might flip the template before putting it into the conv operation expecting the template to get flipped again when it is applied. This is particularly nasty since if the object you are looking for has any sort of vertical or lateral symmetry you may not even know that you are looking at the wrong thing.

What's the difference?¶

When Does it Matter?¶

Comments