# Introduction

Neural networks are very powerful tools to classify data but they are very hard to debug. Indeed, they do a lot of computation with low level operations so they are like black boxes: we provide inputs and get outputs without any understanding on how the neural network is finding the results.

Few years ago some scientists found ways to delve into the networks used for image categorization. Instead of doing backpropagation on weights like during the learning phase of a neural network, they did backpropagation on the images themselves: in the example below (edited from CS231n), considering *x* are inputs and *w* are weights, each learning step, the gradient (red) is applied to the *x* instead of the *w*.

In this article, we will use the method and code from Google, Simonyan, Yosinski and Chollet to try to visualize the classes and convolutional layers learnt by popular neural networks. The code provided in this article uses the Keras library.

# Naive Approach

The core idea of this visualisation is to input a random image in the neural network. Then, specific output(s) of chosen layers are maximized using backpropagation on the image. These outputs can be the last layer representing the classes or intermediate convolutional layers representing features learnt by the network.

Using Keras, there is how to do this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
import numpy as np import scipy.misc import time import os import h5py from keras.models import Sequential from keras.layers import Convolution2D, ZeroPadding2D, MaxPooling2D, Flatten, Dense, Dropout from keras import backend as K #VGG16 mean values MEAN_VALUES = np.array([103.939, 116.779, 123.68]).reshape((3,1,1)) # path to the model weights file. weights_path = 'vgg16_weights.h5' # util function to convert a tensor into a valid image def deprocess(x): x += MEAN_VALUES # Add VGG16 mean values x = x[::-1, :, :] # Change from BGR to RGB x = x.transpose((1, 2, 0)) # Change from (Channel,Height,Width) to (Height,Width,Channel) x = np.clip(x, 0, 255).astype('uint8') #clip in [0;255] and convert to int return x # Creates a VGG16 model and load the weights if available (see https://gist.github.com/baraldilorenzo/07d7802847aaad0a35d3) def VGG_16(w_path=None): model = Sequential() model.add(ZeroPadding2D((1,1),input_shape=(3,224,224))) model.add(Convolution2D(64, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(64, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(128, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(128, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(Flatten()) model.add(Dense(4096, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(4096, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(1000, activation='linear')) # avoid softmax (see Simonyan 2013) if w_path: model.load_weights(w_path) return model # Creates the VGG models and loads weights model = VGG_16(weights_path) # Specify input and output of the network input_img = model.layers[0].input layer_output = model.layers[-1].output # List of the generated images after learning kept_images = [] # Update coefficient learning_rate = 500. for class_index in [130, 351, 736, 850]: #130 flamingo, 351 hartebeest, 736 pool table, 850 teddy bear print('Processing filter %d' % class_index) start_time = time.time() # The loss is the activation of the neuron for the chosen class loss = layer_output[0, class_index] # we compute the gradient of the input picture wrt this loss grads = K.gradients(loss, input_img)[0] # this function returns the loss and grads given the input picture # also add a flag to disable the learning phase (in our case dropout) iterate = K.function([input_img, K.learning_phase()], [loss, grads]) np.random.seed(1337) # for reproducibility # we start from a gray image with some random noise input_img_data = np.random.normal(0, 10, (1,) + model.input_shape[1:]) # (1,) for batch axis # we run gradient ascent for 1000 steps for i in range(1000): loss_value, grads_value = iterate([input_img_data, 0]) # 0 for test phase input_img_data += grads_value * learning_rate # Apply gradient to image print('Current loss value:', loss_value) # decode the resulting input image and add it to the list img = deprocess(input_img_data[0]) kept_images.append((img, loss_value)) end_time = time.time() print('Filter %d processed in %ds' % (class_index, end_time - start_time)) #Compute the size of the grid n = int(np.ceil(np.sqrt(len(kept_images)))) # build a black picture with enough space for the kept_images img_height = model.input_shape[2] img_width = model.input_shape[3] margin = 5 height = n * img_height + (n - 1) * margin width = n * img_width + (n - 1) * margin stitched_res = np.zeros((height, width, 3)) # fill the picture with our saved filters for i in range(n): for j in range(n): if len(kept_images) <= i * n + j: break img, loss = kept_images[i * n + j] stitched_res[(img_height + margin) * i: (img_height + margin) * i + img_height, (img_width + margin) * j: (img_width + margin) * j + img_width, :] = img # save the result to disk scipy.misc.toimage(stitched_res, cmin=0, cmax=255).save('naive_results_%dx%d.png' % (n, n)) # Do not use scipy.misc.imsave because it will normalize the image pixel value between 0 and 255 |

To run this code, you will need Keras, of course, and the VGG16 weights learnt for ILSVRC 2014. You can find them on VGG-16 pre-trained model for Keras · GitHub.

While the idea is simple, there are some tricky parts in the code. First, you must be careful on how the images were fed to the network during the learning phase. Usually, the mean value of the each pixel in the dataset or each channel is subtracted to each pixel of the input image. The order of the channels can be a source of errors too: it can be RGB or BGR depending on the image library used (RGB for PIL and BGR for OpenCV). Finally, if the last layer has a softmax activation, this activation should be removed. Indeed, maximizing a softmax for one class can be done in two ways: maximizing the class score before the softmax or minimizing all the other classes scores before the softmax. The latter often happens resulting in very noisy images, see Simonyan 2013: Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.

For my tests, I used four classes, you can find the index of all classes in the synset_words.txt file:

- Top left, class 130, flamingo
- Top right, class 351, hartebeest
- Bottom left, class 736, pool table, billiard table, snooker table
- Bottom right, class 850, teddy, teddy bear

Here are the results produced by the previous script, for a learning rate of 250, 500, 750 and 1000:

The results are not great, to say the least, but with a bit of imagination and knowing the classes, we can distinguish some interesting details. In the lower left we can imagine a part of a pool table with one or two balls. on the lower right we can imagine heads or limbs of teddy bears. So even is the results are not exploitable, the algorithm is not producing garbage. With a bit of tweaking I might be able to make cleaner and nicer images.

An interesting result with these images is that they all have a very high confidence rate (>99%) in their respective classes. This process is the base of the generation of adversarial and fooling examples, that is, images that scores very high for a single classes but that are unrecognizable by humans. See Deep Neural Networks Are Easily Fooled: High Confidence Predictions For Unrecognizable Images and Breaking Linear Classifiers on ImageNet for further details.

# Using Regularization to Generate More Natural Images

The images produced by the previous algorithm are not natural images, they have very high frequencies and colors saturate. One way to avoid this behavior is to modify the loss so that the learning process favors more natural images over unnatural ones. The other method is to apply some modification to the image after each optimization step so that the algorithm tends toward nicer images. This approach is described in Understanding Neural Networks Through Deep Visualization. This method is more flexible and easier to use as there are a lot of image filters already available. We will review some of the operations we can do on the images and the effects they have.

## Clipping

The most obvious way to modify the image is to ensure it is a valid image: all pixel values must be between (0,0,0) and (255,255,255) for a 24 bit image. In the case of the VGG16 network, the mean is subtracted from the input, so each step we must modify the input image tensor as follows:

1 |
input_img_data = np.clip(input_img_data, 0.-MEAN_VALUES, 255.-MEAN_VALUES) |

This regularization ensures that all pixels have a reasonable influence on the final output. Here is an example of the effets of this regularization, with a learning rate of 1000 and 1000 iterations:

The result is not clearly better than without clipping, it only reduces slightly the saturation and the high frequencies. As it mostly serves as a safeguard against images outside the valid range, we will keep this regularization for the other tests.

## Decay

While clipping avoids values outside the valid range of images, it does nothing to make the images look more natural. A simple regularization is to make the image closer to the mean at each step. It avoids bright pixels with very high values in red, green or blue. The code to do decay is:

1 2 |
if l2decay > 0: input_img_data *= (1 - l2decay) |

with l2decay the amount of decay. This value is usually very low, around 0.0001, but it really depends on the strength of the learning rate. For high learning rates, decay must be stronger to compensate the important modifications on the image. Here are example results with clipping and a decay of 0.0001 and 0.01 with a learning rate of 1000 and 1000 iterations:

As we can see, the higher the decay the grayer the image for the same learning rate. The decay acts as a force that pulls the image toward the mean image which is often mostly gray. Decay alone does not produce great results because it mostly reduces saturation but not that much high frequencies.

## Blur

With the problem of unnaturally bright pixels partly addressed, it’s time to focus on the high frequencies produced in the images. The most obvious solution is to apply a blur on the image to make it smoother. As blur is a bit slow and computationally intensive, it is often applied once in a while. Moreover, applying a small blur many times has quite a similar effect as applying a big blur once in a while . The code to blur is image is the following:

1 2 |
if blurStd is not 0 and i % blurEvery == 0 : input_img_data = gaussian_filter(input_img_data, sigma=[0, 0, blurStd, blurStd]) # blur along H and W but not channels |

with blurStd the standard deviation for Gaussian kernel, blurEvery the frequency of the blurring and i the optimization step number. Usually, the standard deviation has values from 0.3 to 1 and is applied every 4 to 8 updates. Of course, high standard deviations Gaussian filters should be applied less often than low standard deviation filters. Again, these values depend on the learning rate. Here are the results with clipping and a blur of std of respectively 0.5 every 8 updates, 1 every 8 updates, 0.5 every 4 updates and 1 every 4 updates (still 1000 iterations):

Using only blur, the images begins to be recognizable. The pool table can be seen without clues, the flamingo and hartebeest can be guested but it is still difficult. For the teddy bear, it is very difficult to find out what the image represents. From the example above, we can see that the blurring does indeed remove high frequencies but it also make the colors very dim.

## Median Filter

While blur gives some nice results there is still a lot of room for improvement. So I tried others image noise reduction filters and found the median filter. It has the nice characteristic of keeping edges which are both important for humans and neural nets to recognize images. The code to apply a median filter to an image is the following:

1 2 |
if mFilterSize is not 0 and i % mFilterEvery == 0 : input_img_data = median_filter(input_img_data, size=(1, 1, mFilterSize, mFilterSize)) |

with mFilterSize the median filter size, mFilterEvery the frequency of the filtering and i the optimization step number. Like the blur, we don’t need to apply the filter each step and I found that median filters of size 3×3 or 5×5 applied every 4 to 12 updates can give good results. Similarly to the blur, these values depends on the learning rate and big median filters should be applied less often than small ones. Here are the results for clipping and respectively a median filter of size 3 every 12 updates, size 3 every 8, size 5 every 12 and size 5 every 8 (still 1000 iterations):

Median filter gives quite good results keeping the shapes while removing high frequencies. It is still a bit difficult to determine the content of each image but with a filter of size 5 applied every 12 updates it may be possible to guess the 4 classes. Overall, the median filter seems to be a good alternative to the blur filter.

## Others

There are many other regularizations used to produce better images but I couldn’t test them all. For instance, you can see how Yosinski clips pixels with small norm or contribution. Many other image-enhancing filters could be used, look at GIMP and Photoshop to give you some ideas.

## Picking The Best-Looking Images

All these regularizations aim at better-looking images. But better-looking does not mean optimum in regard to the loss. After each regularization, we can observe a drop in the loss value. This is not really a problem as the final result looks better but it poses the question of which image to present as the “best result”.

In this article, I chose the easiest solution: I keep the very last image generated after clipping but before all other regularizations. Indeed, blurring particularly, but other regularizations too, may remove important details. By ending with one or more “pure” gradient ascent, I ensure that images contain fine details. The regularizations are here to avoid the algorithm drifting into high frequencies images.

There are, of course, other solutions, like keeping the image with the higher loss but high loss does not necessarily means better looking image. Some tests should be done to see if it is really important to define a strategy to pick an image and if yes, which strategy works the best.

# Combining Regularizations and Algorithm Hyperparameters

So we have an optimization algorithm and regularization methods, each with several parameters. These parameters are called hyperparameters as they are not the parameters of the model but the those used to modify the image in our case (usually, it’s the parameters of the model which are modified).

Each of these hyperparameters has an important impact on the generated result. As it is slow and impractical to test each parameter I relied on Random Search for Hyper-Parameter Optimization. The idea is simple: instead of doing a grid search with the hyperparameters, we do a random search, maximizing the chance of finding a good value for one or several very important parameters.

However, in our case this is not that easy, as parameters have a huge impact on each other: high learning rate requires high decay, high blurring requires low decay, etc. Moreover, there are a lot of choices on how the algorithm is working by changing:

- How the learning rate evolves during the learning phase;
- How to do the gradient decent (classic, nesterov, rmsprop, adam, etc.);
- How to initialize the first image (uniform random, gaussian, etc.), and often two similarly generated random images can produce very different results;
- How to define the loss, this question will be very important when we will optimize the convolutional filters.

As it is not enough, the hyperparameters and the choices for the algorithm can work well for a specific neural net but not for others. For this article, I worked on the VGG16 and the CaffeNet Yosinski networks. For these two networks, I found that , using both manual and random search, clipping and median filters alone worked quite well, combined with a constant learning rate and a classic gradient ascent. The starting images were generated with a normal(0,10). You can find the code for this algorithm in the beginning of the post and in the clipping and median filter sections.

# Results

## Classes

### VGG16

These results were found using a learning rate of 8000, clipping, a median filter of size 5 applied every 4 updates and 1000 iterations. The images are strange, colors seems to be wrong with a lot of green. Maybe it’s a bug in my code but the color of some objects in the image are good, so I would say the problem is elsewhere. Here are the results, in order goldfish, hen, magpie, scorpion, American lobster, flamingo, German shepherd, starfish, hartebeest, giant panda, abacus, aircraft carrier, assault rifle, broom, feather boa, mountain tent, pool table and teddy bear:

What is interesting is that although most of the images are hardly distinguishable, some fine details can be visible like for instance the magpie head. Maybe the median filter is not enough to regularize the images, other tests should be done with the VGG16 network and there is definitively a problem with the green channel!

### CaffeNet Yosinski

These results were found using a learning rate of 30000, clipping, a median filter of size 5 was applied every 4 updates and 1000 iterations. I found these result quite amazing, even if the quality could still be improved. I was able to identify most of the classes represented by the images without any clue.

Animals gave the best results. In order, goldfish, hen, magpie, scorpion, American lobster, flamingo, German shepherd, starfish, hartebeest and giant panda:

Man-made objets were a bit more challenging but many are still recognizable. In order, abacus, aircraft carrier, assault rifle, broom, feather boa, mountain tent, pool table and teddy bear:

Using the same technique as on the VGG16, I had far better results with the CaffeNet Yosinski. I don’t know exactly why but it proves that it is possible to generate human-recognizable images using a trained deep net. It seems however that some deep nets are harder to visualise that others.

Concluding this part on classes visualization, here are the first 200 iterations with a learning rate of 30000, clipping and median filter of size 5 applied every 4 updates for the hen class and the CaffeNet Yosinski:

This shows that the convergence is pretty fast and that the 1000 iterations used in the previous results may not be needed for all classes. An early stopping mechanism using the loss value should be added to make the generation faster without losing quality.

## Filters

So far, we maximized the output of one class but it is possible to do the same with each layer to understand what they are detecting. The deeper in the network, the more complex the pattern the filter can recognize. The loss is a bit different for filters and you have basically two choices: you can optimize one filter or all filters in a layer. I chose the latter because it allows me to generate bigger images for shallow layers . The loss function is the following:

1 |
loss = K.sum(layer_output[:, layer_index, :, :]) |

I chose to generate images the same size as the input of the model but it is also possible to remove the fully connected part of the network to generate images of arbitrary sizes, see How convolutional neural networks see the world.

The last step is to choose the optimized layer. Convolutional layers give the best result but you must be careful to optimize the layers AFTER the activation (in our case ReLU). Optimizing before the activation gives very poor results.

In the following, *lr* means learning rate, *mf* means median filter followed by its size. Clipping is always applied and there is 200 iterations of the gradient ascent. Only two typical filters are presented for each layers.

### VGG16

#### relu1_1 (lr 1)

#### relu1_2 (lr 1)

#### relu2_1 (lr 2, mf 3 every 12)

#### relu2_2 (lr 2, mf 3 every 12)

#### relu3_1 (lr 6, mf 3 every 6)

#### relu3_2 (lr 10, mf 3 every 6)

#### relu3_3 (lr 10, mf 3 every 6)

#### relu4_1 (lr 40, mf 5 every 4)

#### relu4_2 (lr 40, mf 5 every 4)

#### relu4_3 (lr 40, mf 5 every 4)

#### relu5_1 (lr 80, mf 5 every 4)

#### relu5_2 (lr 80, mf 5 every 4)

#### relu5_3 (lr 80, mf 5 every 4)

### CaffeNet Yosinski

#### relu1 (lr 10)

#### relu2 (lr 50, mf 3 every 12)

#### relu3 (lr 100, mf 3 every 6)

#### relu4 (lr 200, mf 5 every 6)

#### relu5 (lr 300, mf 5 every 4)

Even if we know how the networks are working, it is still very impressive to see how lower layers and filters learn to extract simple features like lines and colors and, using these features, how higher layers and filters learn complex shapes and even classes. Indeed, we can distinguish shellfishes, cups, birds, balls and pandas in the images generated for the last layers.

What is interesting too is that the VGG16 and the CaffeNet Yosinski learns the same kind of low level filters and we can wonder if it is also true for high level filters, see Convergent Learning: Do different neural networks learn the same representations?.

# Conclusion

In this article, I explained how to generate images using backward propagation on deep networks for image classification.

Maximizing classes output, these generated images can be used to find fooling examples that are indistinguishable by a human but that are given a very high confidence in one class by the deep network. With a bit of tuning, this process can also generate images that can be recognized by humans. It allows us to have some feedback on what the networks learnt to be a good example of one class.

Maximizing convolutional layers, these generated images give us some understanding on the inner workings of the network. While many low level filters only detect edges in some directions, high level filters can detect very complex shapes.

Finally, note that this work is very similar to Understanding Neural Networks Through Deep Visualization. The difference in the results in mainly due to the fact that I do not normalize the resulting images (see the normalization in Yosinski’s work in gradient_optimizer.py, saveimagesc in image_misc.py and norm01 in image_misc.py). This difference induces high differences in the hyperparameters too. I also use median filter instead of blur which gives far better results in my opinion.

# Ideas For Improvements

While the results are quite good, there is room for improvement. Regularization of the generated image during learning may not be sufficient to generate good-looking images (if it is even possible). The way backpropagation is done and the loss function could be tweaked to improve the images. A good idea to start is to compare the activations for a real image and a generated image. It could be possible to find ways to make these activations look the same and hope that the algorithm generates good images. This comparison could be done using Yosinski’s Deep Visualization Toolbox.

# References

- How convolutional neural networks see the world
- Jason Yosinski
- Simonyan 2013: Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
- Research Blog: Inceptionism: Going Deeper into Neural Networks
- Bergstra 2012: Random Search for Hyper-Parameter Optimization

Going further:

## andrew kiruluta

I enjoyed reading your blog. Do you know of a good way to visualize then hidden layers in terms of correlation matrices between the layers and the expected output ? It would help in determining how many hidden layers are indeed necessary.

thanks,

andrew