# Introduction

In this post we will study the MNIST database which is very popular to test new models. This database has been put together by Yan LeCun et al. using existing data. It consist of 28×28 pixels images of digits separated in 60,000 training examples and 10,000 test examples. Without going into the details, this data has been preprocessed and organised to make it easy to use with various algorithms and to insure that results are unbiased. If you want more details, you can read the dedicated page.

# The MNIST Database

First thing first, when you discover new data, you want to visualize it so as to understand what your model will be given. There are many ways to visualize the data and its structure. In this post we will use the simplest method for the MNIST database: displaying the images. You could also want to see the structure of the data with a PCA or a t-SNE or many other algorithms.

Instead of downloading and loading manually the data into our code, we use the dedicated method in Keras:

1 2 3 |
from keras.datasets import mnist (X_train, y_train), (X_test, y_test) = mnist.load_data() |

With these lines of code we load all the MNIST data into 4 variables:

*X_train*is a tensor of shape (60000, 28, 28) containing, for each training grayscale image, the value of each pixel which is a int in [0;255]*y_train*is a tensor of shape (60000, 1) containing, for training test image, the label of each image which is an int in [0;9]*X_test*is a tensor of shape (10000, 28, 28) containing, for each test grayscale image, the value of each pixel which is a int in [0;255]*y_test*is a tensor of shape (10000, 1) containing, for each test image, the label of each image which is an int in [0;9]

We will then reorganize the data so that we have a tensor containing the pixel of the images positioned side by side and use a magic fonction to generate an image with that tensor. This function needs a package name PIL, so we must first install it. In the command line:

1 |
conda install pillow |

Then we can code the tensor reorganization and the image generation:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
from keras.datasets import mnist from scipy.misc import imsave import numpy as np import math (X_train, y_train), (X_test, y_test) = mnist.load_data() #generate a separate image for training and test sets for (dataset, name) in [(X_train, "mnist_train"), (X_test, "mnist_test")]: #We will make a square grid which can contain s*s images s = math.ceil(math.sqrt(dataset.shape[0])) #Our image will be of size w*h. In the case of MNIST w=h w = s*dataset.shape[1] h = s*dataset.shape[2] #Create empty tensor allimgs = np.empty([w, h]) #Fill the newly created tensor for index in range(dataset.shape[0]): iOffset = (index%s)*dataset.shape[1] #remainder of the Euclidian division jOffset = (index//s)*dataset.shape[2] #quotient of the Euclidian division for i in range(dataset.shape[1]): for j in range(dataset.shape[2]): allimgs[iOffset+i,jOffset+j] = dataset[index, i, j] #Copy the pixel value #Generate the image imsave(name+".png", allimgs) |

This code will give you two images:

As you can see, the data is shuffled and it is not that clean :

- Digits come from different origins (see the 7 and 9);
- Some digits are truncated at the bottom or the top;
- There is some variation on how well the digits are written and the thickness of the line.

You can also group the digits according to the label as done here.

# Testing a Multilayer Perceptron

The simplest, oldest and most known neural network type is the multilayer perceptron (MLP). However, we will use some tricks discovered in the last few years to improve the learning, both in quality and in speed. The first trick is to use ReLU activation instead of the sigmoid one, see here why. The second trick is to use dropout for regularization, see here why.

This example is based on the example MLP in Keras Github. In this post we will only make the neural networks learn a good classification. We will see how to improve the result by tuning the hyper-parameters (structure of the model, optimization algorithm, learning rate, etc.) in next posts.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
import numpy as np np.random.seed(1337) # for reproducibility import os from keras.datasets import mnist from keras.models import Sequential from keras.layers.core import Dense, Dropout, Activation from keras.optimizers import RMSprop from keras.utils import np_utils batch_size = 128 #Number of images used in each optimization step nb_classes = 10 #One class per digit nb_epoch = 20 #Number of times the whole data is used to learn (X_train, y_train), (X_test, y_test) = mnist.load_data() #Flatten the data, MLP doesn't use the 2D structure of the data. 784 = 28*28 X_train = X_train.reshape(60000, 784) X_test = X_test.reshape(10000, 784) #Make the value floats in [0;1] instead of int in [0;255] X_train = X_train.astype('float32') X_test = X_test.astype('float32') X_train /= 255 X_test /= 255 #Display the shapes to check if everything's ok print(X_train.shape[0], 'train samples') print(X_test.shape[0], 'test samples') # convert class vectors to binary class matrices (ie one-hot vectors) Y_train = np_utils.to_categorical(y_train, nb_classes) Y_test = np_utils.to_categorical(y_test, nb_classes) #Define the model achitecture model = Sequential() model.add(Dense(512, input_shape=(784,))) model.add(Activation('relu')) model.add(Dropout(0.2)) model.add(Dense(512)) model.add(Activation('relu')) model.add(Dropout(0.2)) model.add(Dense(10)) #Last layer with one output per class model.add(Activation('softmax')) #We want a score simlar to a probability for each class #Use rmsprop to do the gradient descent see http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf #and http://cs231n.github.io/neural-networks-3/#ada rms = RMSprop() #The function to optimize is the cross entropy between the true label and the output (softmax) of the model model.compile(loss='categorical_crossentropy', optimizer=rms, metrics=["accuracy"]) #Make the model learn model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch, verbose=2, validation_data=(X_test, Y_test)) #Evaluate how the model does on the test set score = model.evaluate(X_test, Y_test, verbose=0) print('Test score:', score[0]) print('Test accuracy:', score[1]) |

This code generates this model:

With this this model we achieve an accuracy of 98.43% which is good but not great. You can compare the score to other models on the MNIST page. There are lots of improvements to do on the model, but for a first model, it is more than honorable.

During the learning, the loss and accuracy on both the training and the test set will be displayed. Here are those numbers displayed on a graph:

This data is very important to analyse how well the learning did. However it’s not that easy to read. The most important thing to understand is that **you cannot compare training and test loss/accuracy values.** As explained on Keras FAQ during training the network use dropout and regularization to compute the scores whereas during test it does not. That is why on the previous graph for the first epochs the loss is higher (and the accuracy is lower) for the training set than for the test set. Without dropout and regularization, the model would almost every time score better on the training set than on the test set.

What we can say using this graph is that the optimization process is working: the training loss is decreasing as data is feed to the algorithm. However looking at the test accuracy, it seems to reach a plateau after 10 epochs. It does not seems that adding more epoch would increase the accuracy. The model does not overfit, that means that the model does not learns specific features for the training set that are not pertinent for the test set. If that was the case, the test accuracy would decrease whereas the training accuracy would still increase.

# Testing a Convolutional Neural Network

MLP are now rarely used alone to classify images. The state-of-the-art neural networks are now based on convolutions and are called ConvNets. For this model we will again use ReLU activation and dropout. This model is also based on the example CNN in Keras Github.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
import numpy as np np.random.seed(1337) # for reproducibility import os from keras.datasets import mnist from keras.models import Sequential from keras.layers.core import Dense, Dropout, Activation, Flatten from keras.layers.convolutional import Convolution2D, MaxPooling2D from keras.utils import np_utils batch_size = 128 nb_classes = 10 nb_epoch = 12 # input image dimensions img_rows, img_cols = 28, 28 # number of convolutional filters to use nb_filters = 32 # size of pooling area for max pooling nb_pool = 2 # convolution kernel size nb_conv = 3 # the data, shuffled and split between train and test sets (X_train, y_train), (X_test, y_test) = mnist.load_data() #Add the depth in the input. Only grayscale so depth is only one #see http://cs231n.github.io/convolutional-networks/#overview X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols) X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols) #Make the value floats in [0;1] instead of int in [0;255] X_train = X_train.astype('float32') X_test = X_test.astype('float32') X_train /= 255 X_test /= 255 #Display the shapes to check if everything's ok print('X_train shape:', X_train.shape) print(X_train.shape[0], 'train samples') print(X_test.shape[0], 'test samples') # convert class vectors to binary class matrices (ie one-hot vectors) Y_train = np_utils.to_categorical(y_train, nb_classes) Y_test = np_utils.to_categorical(y_test, nb_classes) model = Sequential() #For an explanation on conv layers see http://cs231n.github.io/convolutional-networks/#conv #By default the stride/subsample is 1 #border_mode "valid" means no zero-padding. #If you want zero-padding add a ZeroPadding layer or, if stride is 1 use border_mode="same" model.add(Convolution2D(nb_filters, nb_conv, nb_conv, border_mode='valid', input_shape=(1, img_rows, img_cols))) model.add(Activation('relu')) model.add(Convolution2D(nb_filters, nb_conv, nb_conv)) model.add(Activation('relu')) #For an explanation on pooling layers see http://cs231n.github.io/convolutional-networks/#pool model.add(MaxPooling2D(pool_size=(nb_pool, nb_pool))) model.add(Dropout(0.25)) #Flatten the 3D output to 1D tensor for a fully connected layer to accept the input model.add(Flatten()) model.add(Dense(128)) model.add(Activation('relu')) model.add(Dropout(0.5)) model.add(Dense(nb_classes)) #Last layer with one output per class model.add(Activation('softmax')) #We want a score simlar to a probability for each class #The function to optimize is the cross entropy between the true label and the output (softmax) of the model #We will use adadelta to do the gradient descent see http://cs231n.github.io/neural-networks-3/#ada model.compile(loss='categorical_crossentropy', optimizer='adadelta', metrics=["accuracy"]) #Make the model learn model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch, verbose=1, validation_data=(X_test, Y_test)) #Evaluate how the model does on the test set score = model.evaluate(X_test, Y_test, verbose=0) print('Test score:', score[0]) print('Test accuracy:', score[1]) |

This code generates this model:

We can see that, as convolution layers have a stride/subsample of 1 and no zero-padding, the output width and heigh are reduced by 2 by each layer. With zero-padding, the output would have the same heigh and width as the input. Note that convolution layers all have a depth of 32 but it could be a depth of 32 for the first layer and, for instance, a depth of 64 for the second layer.

The max pooling layer being of size 2*2 and by default without overlapping (so in our case a stride of 2), it divides the height and width by two. This is the most used type of max pooling layers. Max pooling layers of 3*3 and higher were found to be often too destructive to give good results.

For the dense (fully connected) layers, the first has 4608 inputs because the flatten layer flattens a tensor of shape (32, 12, 12) which contains 32*12*12=4608 values.

With this this model we achieve an accuracy of 99.14% which is 0.71% better than the MLP. It is still not as good as the best models but it surpasses many classification models.

The evolution of loss and accuracy over epochs are very similar to the results for the MLP: the training loss in decreasing over the epoch and the test accuracy is increasing to a plateau after 8 to 10 epochs. In this case it is very obvious that the training and the test loss and accuracy cannot be compared by values: the values are worst for the training set than the test set, again because they are not really the same measures.

# Conclusion

In this post we discovered the MNIST database which is very useful to test new models on simple but real-world data. The learning is quite fast on this kind of data which allows to test many different configurations.

On this data, we applied a simple Multilayer Perceptron to get the grasp of how to define neural networks in Keras. This model gave decent results despite the model being very easy to code and to learn. We then coded a Convolutional Network which is closer to state-of-the-art models. The results were better than the MLP but still far from the best results obtained on the MNIST dataset. There is still room to optimize the architecture of the model and the learning algorithms.

## Ahmed Desoky

Hey,

Lots of thanks for the efforts.

The comments linking the code by the lectures specially of CS231 are so valuable.

However, the code might not work due to you do not set image ordering.

You shall add this in the beginning of your code (or more specifically, before training your model):

from keras import backend as K

K.set_image_dim_ordering(‘th’)

Refer to this link on stackoverflow for more details:

http://stackoverflow.com/questions/39815518/keras-maxpooling2d-layer-gives-valueerror

Thanks,

Ahmed Desoky

Algorithms SW Engineer

Avelabs LLC