Deep Learning 101
Feel Free To Follow Along At Home¶
download the notebook and fire up a jupyter server and play around. You can get my posts as .ipynb files just by replacing the .html with .ipynb in the url e.g.
wget http://asymptoticlabs.com/blog/posts/deep-learning-101.ipynb
jupyter notebook
What Can You Do With Deep Learning ?¶
- Learn simple easy to use representations of data
- Predict the folding structure of proteins
- Predict the chemical properties of molecules
- Play super human Chess/Go/Shogi/...
- Turn random noise into novel images
- Recommend new content to users
- Rank the relevancy of search result candidates
- Automatically caption images
- Identify the people in a photo
- Automatically generate interactive fiction
- Predict y given x
Deep Learning in a Nutshell ?¶
- Formulate a problem so that you have inputs x and target labels y
- The inputs x are the initial "activations" of the network
- For each layer in the network which is connected to the current activations
- Mix your inputs together into a large number of buckets call them "channels". The mixing is done by matrix multiplication.
- Forget the parts of the mixed up inputs less than some threshold in each channel
- Pass the results as an input to the next layer
- Turn the last set of outputs into probabilities (Squeeze them so they lie between 0 and 1 and sum to 1).
- Slightly alter particular mix of inputs to improve predictions to match a set of target labels
- Repeat till you get good results or get bored
First Some Imports¶
I use TensorFlow + Keras but pytorch has recently become quite mature and you may actually want to start with that, If I was starting over again I think I would prefer pytorch but I just haven't devoted the time needed to learn it.
import tensorflow as tf
import tensorflow.keras.layers as L #All the types of network modules
import numpy as np
import matplotlib.pyplot as plt
#TensorFlow is still changing quickly
tf.__version__
plt.rcParams.update(
{
"figure.figsize":(12, 6),
"font.size":15,
}
)
A (Very) Simple Problem to Solve¶
With Just a Hint of Real World Complexity¶
- Goal: Learn to map the integers to a class probability with a small amount of label noise
- x = 0 -> y=(1, 0, 0, ...)
- x = 1 -> y=(0, 1, 0, ...)
- ...
- In the real world this would be learning a class probability as a function of a single categorical variable
Data¶
def generate_random_digit_xy_pairs(
n_samples,
n_classes=None,
scale=1.0,
corruption_fraction=0,
):
x = (scale*np.random.exponential(size=(n_samples, 1))).astype(int)
if n_classes is None:
n_classes = len(np.unique(x))
x = np.clip(x, 0, n_classes-1) #clip to a fixed range
x = x.reshape((-1, 1)) #add a channel dimension
y = x
#add a small amount of label noise
if corruption_fraction > 0:
y = np.where(
np.random.random(y.shape) < corruption_fraction,
np.random.randint(n_classes+1, size=y.shape),
y
)
y = tf.keras.utils.to_categorical(x, num_classes=n_classes)
return x, y
np.random.seed(1234)
n_classes = 10
scale = 1.5
x_train, y_train = generate_random_digit_xy_pairs(
n_samples=1000,
n_classes=n_classes,
scale=scale,
corruption_fraction=0.05
)
x_test, y_test = generate_random_digit_xy_pairs(
n_samples=1000,
n_classes=n_classes,
scale=scale,
corruption_fraction=0.05,
)
list(zip(x_train[:5], y_train[:5]))
plt.hist(x_train, bins=20)
plt.xlabel("X value")
plt.ylabel("Count ")
A Somewhat Naive Approach¶
def build_naive_network(n_classes):
input_activations = L.Input(shape=[1])
hidden_layer = L.Dense(
units=32, #number of output channels
activation="relu",
)
#calling a layer as a function applies it to the input activations
hidden_activations = hidden_layer(input_activations)
output_layer = L.Dense(
units=n_classes,
activation="softmax",
)
class_probabilities = output_layer(hidden_activations)
#make the model
model = tf.keras.models.Model(input_activations, class_probabilities)
return model
ReLU? Softmax? Whatsat?¶
A ReLU activation is in some sense just about the most gentle non-linearity you could imagine. It is linear almost everywhere but has a kink at 0. ReLU activations have come to be the defacto standard.
$$ ReLU(x) = max(0, x) $$Softmax just turns a vector of numbers both positive and negative into a vector that can be considered probabilities, which is very helpful for classification problems. $$ {\large softmax(x) = \frac{exp(x_{j})}{\sum_j^{N_C} exp(x_{j})} } $$
Every once in a while you may want to use "tanh" or "sigmoid" activations but most of the time these suffice.
model = build_naive_network(n_classes=n_classes)
model.summary()
tf.keras.utils.plot_model(
model,
show_shapes=True,
show_layer_names=True,
)
Picking an Optimizer and a Loss¶
optimizer = tf.keras.optimizers.Adam() # Modern standard optimizer
model.compile(
loss=tf.keras.losses.CategoricalCrossentropy(), # typical for classification problems
optimizer=optimizer,
metrics=["accuracy"], #extra numbers to track
)
Check That The GPU Has Been Found¶
e.g. if you are running in a docker you need to use either the --runtime nvidia flag or the nvidia-docker command instead of docker otherwise the GPU will not be found. There may also be driver problems etc etc.
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
Train The Model¶
fit_history = model.fit(
x_train,
y_train,
validation_data=(
x_test,
y_test,
),
batch_size=32,
epochs=20,
)
Make Sure You Check Your GPU Utilization!¶
nvidia-smi -l
The chances are very high that your GPU utilization may be low or even that you aren't using the GPU at all! I have frequently found that somehow tensorflow has quietly fallen back to CPU only and sometimes it can be difficult to tell the difference between a tough to compute model and a model that is being churned through on only the CPU. I usually don't bother with getting my GPU utilization above around 75% and often settle for just 50% or so. Squeezing that last little bit out of the GPU can often be very hard and simple changes to your data pipeline can easily tank all that hard work.
This model is incredibly light weight and I can't get much above 10% utilization through batch size alone. Often you may need to optimize your data pipeline in some way for.
The Importance of Broken Clocks¶
Even a broken clock is right twice a day. And your first baseline for a model should always be A constant model one whose output does not change with respect to the inputs at all. These sorts of baselines are often the most useful possible diagnostics because you can know for certain that something is wrong if your model isn't at least beating this simple baseline. What would a good categorical cross entropy for this problem be? 1.2, 0.5, 0.005? It is very hard to say problem to problem but if you know the "broken clock" model would give you the same performance as your model currently does then you know that either something is wrong or you just don't have real predictive power.
broken_clock_probs = np.mean(y_train, axis=0)
broken_clock_probs
from sklearn import metrics
metrics.log_loss(y_test, np.ones((y_test.shape))*broken_clock_probs)
Which if the probabilities are well calibrated will be equal to the entropy of the labels!
def calc_entropy(p):
return -1.0*np.sum(p*np.log(p))
calc_entropy(broken_clock_probs)
So the model beats a broken clock by a factor of 2 should we consider that a win? Do we need more data? Should we try a new model architecture or data augmentation somehow?
Tip: The "v" weights of the adam optimizer are an estimate of the running average squared gradient magnitude¶
[(w.name, w.shape, np.mean(w.numpy())) for w in optimizer.weights if "/v:" in w.name]
Even extremely large values or extremely small values here aren't necessarily a problem but they often can be. For example I recently was training a model where these values were all 1e-10 or smaller in nearly every layer which was related to some problems with the loss function I was using. The deeper the network usually the smaller these values are though things like batch normalization combat that somewhat.
What did it learn?¶
def make_activation_model(net):
activation_map = {layer.name:layer.output for layer in net.layers}
return tf.keras.Model(net.inputs, activation_map)
activation_model = make_activation_model(model)
feature_maps = activation_model.predict(np.arange(10)[:, np.newaxis])
hidden_features = feature_maps["dense"]
for channel in range(32):
plt.plot(hidden_features[:, channel], alpha=0.7, lw=3)
Gross! The network has learned features which are all mostly the same (common and not necessarily bad for a neural network, but
for name in feature_maps:
plt.title(name)
plt.imshow(feature_maps[name])
plt.ylabel("Digit Input")
plt.xlabel("Activation Channel")
plt.show()
About 20% Cooler¶
def build_network(
channels,
n_classes,
dropout=-1,
vocabulary_size=None,
embedding_dimension=None,
):
#I like to chain activations in blocks together by overwriting "x"
x = L.Input(shape=[1])
input_activations = {
"a_categorical_feature":x
}
if not (embedding_dimension is None):
# Important Note!
# input dimension would not normally be equal to n_classes!!!
x = L.Embedding(input_dim=vocabulary_size, output_dim=embedding_dimension)(x)
x = L.Reshape((embedding_dimension,))(x) #get rid of the sequence dimension
for layer_idx, n_ch in enumerate(channels):
x = L.Dense(n_ch, activation="relu")(x)
if dropout > 0:
x = L.Dropout(dropout)(x)
class_probabilities = L.Dense(n_classes, activation="softmax")(x)
#make the model
model = tf.keras.models.Model(input_activations, class_probabilities)
#compile the model
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(), #can also pass from_logits
optimizer="adam",
metrics=["accuracy"]
)
return model
We Need to Go Deeper!¶
cooler_model = build_network(
[32, 32, 32],
n_classes=n_classes,
)
cooler_x_train = {"a_categorical_feature":x_train}
cooler_y_train = np.argmax(y_train, axis=1)
cooler_x_test = {"a_categorical_feature":x_test}
cooler_y_test = np.argmax(y_test, axis=1)
cooler_model.fit(
cooler_x_train,
cooler_y_train,
validation_data=(
cooler_x_test,
cooler_y_test,
),
batch_size=32,
epochs=20,
)
cooler_activation_model = make_activation_model(cooler_model)
cooler_feature_maps = cooler_activation_model.predict(np.arange(10)[:, np.newaxis])
for name in cooler_feature_maps:
plt.title(name)
plt.imshow(cooler_feature_maps[name])
plt.ylabel("Digit Input")
plt.xlabel("Activation Channel")
plt.show()
Categoricals --> Embeddings¶
But Really we are making the problem too hard on ourselves, Encoding a categorical variable as a scalar variable just because it happens to be an integer is really bad form. The proper way to deal with that input is to put it through an embedding layer. This is more or less equivalent to onehot encoding it and then putting the one hot encoded vector through one dense layer. But it is much much more efficient.
embedding_model = build_network(
[32,],
n_classes=n_classes,
vocabulary_size=n_classes,
embedding_dimension=32,
)
embedding_model.fit(
cooler_x_train,
cooler_y_train,
validation_data=(
cooler_x_test,
cooler_y_test,
),
batch_size=32,
epochs=20,
)
embedding_activation_model = make_activation_model(embedding_model)
embedding_feature_maps = embedding_activation_model.predict(np.arange(10)[:, np.newaxis])
for name in embedding_feature_maps:
fmap = embedding_feature_maps[name]
if len(fmap.shape) != 2:
continue
plt.title(name)
plt.imshow(fmap)
plt.ylabel("Digit Input")
plt.xlabel("Activation Channel")
plt.show()
Comments
Comments powered by Disqus