Machine Learning Experiments


MNIST Database and Simple Classification Networks


In this post we will study the MNIST database which is very popular to test new models. This database has been put together by Yan LeCun et al. using existing data. It consist of 28×28 pixels images of digits separated in 60,000 training examples and 10,000 test examples. Without going into the details, this data has been preprocessed and organised to make it easy to use with various algorithms and to insure that results are unbiased. If you want more details, you can read the dedicated page.

The MNIST Database

First thing first, when you discover new data, you want to visualize it so as to understand what your model will be given. There are many ways to visualize the data and its structure. In this post we will use the simplest method for the MNIST database: displaying the images. You could also want to see the structure of the data with a PCA or a t-SNE or many other algorithms.

Instead of downloading and loading manually the data into our code, we use the dedicated method in Keras:

With these lines of code we load all the MNIST data into 4 variables:

  • X_train is a tensor of shape (60000, 28, 28) containing, for each training grayscale image, the value of each pixel which is a int in [0;255]
  • y_train is a tensor of shape (60000, 1) containing, for training test image, the label of each image which is an int in [0;9]
  • X_test is a tensor of shape (10000, 28, 28) containing, for each test grayscale image, the value of each pixel which is a int in [0;255]
  • y_test is a tensor of shape (10000, 1) containing, for each test image, the label of each image which is an int in [0;9]

We will then reorganize the data so that we have a tensor containing the pixel of the images positioned side by side and use a magic fonction to generate an image with that tensor. This function needs a package name PIL, so we must first install it. In the command line:

Then we can code the tensor reorganization and the image generation:

This code will give you two images:

As you can see, the data is shuffled and it is not that clean :

  • Digits come from different origins (see the 7 and 9);
  • Some digits are truncated at the bottom or the top;
  • There is some variation on how well the digits are written and the thickness of the line.

You can also group the digits according to the label as done here.

Testing a Multilayer Perceptron

The simplest, oldest and most known neural network type is the multilayer perceptron (MLP). However, we will use some tricks discovered in the last few years to improve the learning, both in quality and in speed. The first trick is to use ReLU activation instead of the sigmoid one, see here why. The second trick is to use dropout for regularization, see here why.

This example is based on the example MLP in Keras Github. In this post we will only make the neural networks learn a good classification. We will see how to improve the result by tuning the hyper-parameters (structure of the model, optimization algorithm, learning rate, etc.) in next posts.

This code generates this model:


With this this model we achieve an accuracy of 98.43% which is good but not great. You can compare the score to other models on the MNIST page. There are lots of improvements to do on the model, but for a first model, it is more than honorable.

During the learning, the loss and accuracy on both the training and the test set will be displayed. Here are those numbers displayed on a graph:



This data is very important to analyse how well the learning did. However it’s not that easy to read. The most important thing to understand is that you cannot compare training and test loss/accuracy values. As explained on Keras FAQ during training the network use dropout and regularization to compute the scores whereas during test it does not. That is why on the previous graph for the first epochs the loss is higher (and the accuracy is lower) for the training set than for the test set. Without dropout and regularization, the model would almost every time score better on the training set than on the test set.

What we can say using this graph is that the optimization process is working: the training loss is decreasing as data is feed to the algorithm. However looking at the test accuracy, it seems to reach a plateau after 10 epochs. It does not seems that adding more epoch would increase the accuracy. The model does not overfit, that means that the model does not learns specific features for the training set that are not pertinent for the test set. If that was the case, the test accuracy would decrease whereas the training accuracy would still increase.

Testing a Convolutional Neural Network

MLP are now rarely used alone to classify images. The state-of-the-art neural networks are now based on convolutions and are called ConvNets. For this model we will again use ReLU activation and dropout. This model is also based on the example CNN in Keras Github.

This code generates this model:



We can see that, as convolution layers have a stride/subsample of 1 and no zero-padding, the output width and heigh are reduced by 2 by each layer. With zero-padding, the output would have the same heigh and width as the input. Note that convolution layers all have a depth of 32 but it could be a depth of 32 for the first layer and, for instance, a depth of 64 for the second layer.

The max pooling layer being of size 2*2 and by default without overlapping (so in our case a stride of 2), it divides the height and width by two. This is the most used type of max pooling layers. Max pooling layers of 3*3 and higher were found to be often too destructive to give good results.

For the dense (fully connected) layers, the first has 4608 inputs because the flatten layer flattens a tensor of shape (32, 12, 12) which contains 32*12*12=4608 values.

With this this model we achieve an accuracy of 99.14% which is 0.71% better than the MLP. It is still not as good as the best models but it surpasses many classification models.

The evolution of loss and accuracy over epochs are very similar to the results for the MLP: the training loss in decreasing over the epoch and the test accuracy is increasing to a plateau after 8 to 10 epochs. In this case it is very obvious that the training and the test loss and accuracy cannot be compared by values: the values are worst for the training set than the test set, again because they are not really the same measures.


In this post we discovered the MNIST database which is very useful to test new models on simple but real-world data. The learning is quite fast on this kind of data which allows to test many different configurations.

On this data, we applied a simple Multilayer Perceptron to get the grasp of how to define neural networks in Keras. This model gave decent results despite the model being very easy to code and to learn. We then coded a Convolutional Network which is closer to state-of-the-art models. The results were better than the MLP but still far from the best results obtained on the MNIST dataset. There is still room to optimize the architecture of the model and the learning algorithms.



  1. Hey,
    Lots of thanks for the efforts.

    The comments linking the code by the lectures specially of CS231 are so valuable.

    However, the code might not work due to you do not set image ordering.
    You shall add this in the beginning of your code (or more specifically, before training your model):

    from keras import backend as K

    Refer to this link on stackoverflow for more details:

    Ahmed Desoky
    Algorithms SW Engineer
    Avelabs LLC

  2. How did you plot the graphs?

    • Fabien Tencé

      Hi Sandhya,

      For the graphs representing the layout of the networks I used the plot function in keras.utils.visualize_util.

      For the loss/accuracy graphs, I used callbacks, here is the code:

  3. domenico

    hi! Is there a way for testing, the same script, with a “single decision tree model” in keras, instead of MLP/CNN?


Leave a Reply

Theme by Anders Norén