Ankivil

Machine Learning Experiments

Experiment, Kaggle

Kaggle First Steps With Julia (Chars74k): First Place using Convolutional Neural Networks

Introduction

In this article, I will describe how to design a Convolutional Neural Network (CNN) with Keras to score over 0.86 accuracy in the Kaggle competition First Steps With Julia. I will explain precisely how to get to this result, from data to submission. All the python code is, of course, included. This work is inspired by Florian Muellerklein’s Using deep learning to read street signs.

The goal of the Kaggle competition First Steps With Julia is to classify images of characters taken from natural images. These images come from a subset of the Chars74k data set. This competition normally serves as a tutorial on how to use the Julia language but a CNN is the tool of choice to tackle this kind of problem.

Data Preprocessing

First thing first, you have to get the data from Kaggle.

Image Color

Almost all images in the train and test sets are color images. The first step in the preprocessing is to convert all images to grayscale. It simplifies the data fed to the network and makes it easier to generalize, a blue letter being equivalent to a red letter. This preprocessing should have almost no negative impact on the final accuracy because most texts have high contrast with their background.

Image Resizing

As the images have different shapes and size, we have to normalize them for the model. There are two main questions for this normalization: which size do we choose? and do we keep the aspect ratio?

Initially, I thought keeping the aspect ratio would be better because it would not distort the image arbitrarily. It could also lead to confusion between O and 0 (capital o and zero). However, after some tests, it seems that the results are better without keeping the aspect ratio. Maybe my filling strategy (see the code below) is not the best one.

Concerning the image size, 16×16 images allow very fast training but don’t give the best results. These small images are perfect to rapidly test ideas. Using 32×32 images makes the training quite fast and gives good accuracy. Finally, using 64×64 images makes the training quite slow and marginally improves the results compared to 32×32 images. I chose to use 32×32 images because it is the best trade-off between speed and accuracy.

Label Conversion

We also have to convert the labels from characters to one-hot vectors. This is mandatory to feed the labels information to the network. This is a two-step procedure. First, we have to find a way to convert characters to consecutive integers and back. Second, we have to convert each integer to a one-hot vector.

The Code

Here is the code to convert labels to consecutive integers and back:

Here is the preprocessing code. It assumes the data is in ../data/train and ../data/test :

Data Augmentation

Instead of using the training data as it is, we can apply some augmentations to artificially increase the size of the training set with “new” images. Augmentations are random transformations applied to the initial data to produce a modified version of it. These transformations can be a zoom, a rotation, etc. or a combination of all these.

Conveniently, there is a class for image augmentation in Keras: ImageDataGenerator.

Using ImageDataGenerator

The ImageDataGenerator constructor takes several parameters to define the augmentations we want to use. I will only go through the parameters useful for our case, see the documentation if you need other modifications to your images:

  • featurewise_center , featurewise_std_normalization and zca_whitening are not used as they don’t increase the performance of the network. If you want to test these options, be sure to compute the relevant quantities with fit and apply these modifications to your test set with standardize .
  • rotation_range Best results for values around 20.
  • width_shift_range Best results for values around 0.15.
  • height_shift_range Best results for values around 0.15.
  • shear_range Best results for values around 0.4.
  • zoom_range Best results for values around 0.3.
  • channel_shift_range Best results for values around 0.1.

Of course, I didn’t test all the combinations, so there must be others values which increase the final accuracy. Be careful though, too much augmentation (high parameter values) will make the learning slow or even impossible.

I also added the possibility for the ImageDataGenerator to randomly invert the values, the code is below. The parameters are:

  • channel_flip Best set to True.
  • channel_flip_max Should be set to 1. as we normalized the data between 0 and 1.

Modifications to ImageDataGenerator

In order to invert the colors in an image in a random fashion, you must edit the file keras/preprocessing/image.py . At the end of the random_transform function, add:

The constructor of the ImageDataGenerator is also modified to add the new parameters:

Model

I tried two different architectures:

  • Vikesh’s CNN-2 you can find in his article. It scores 86.52% accuracy on the whole Chars74k dataset. However, in this Kaggle competition, we only use a subset of Chars74k and I only managed to get around 80% validation accuracy.
  • Florian Muellerklein’s VGG-like you can find here. I had the best scores with this one with over 85% validation accuracy, so I’ll describe it in detail.

As a picture is worth a thousand words, here is the model:

cnn_final_model

This model in very similar to Florian Muellerklein’s. I added zero-padding to my convolutional layers and increased the size of the dense layers. Dropout is set to 0.5.

This model gives the best results but it is slow to learn. You can also have nice results by dividing all the filter numbers and dense layers sizes by 2 or 4. Smaller networks are very useful to test different hyper-parameters. It is worth noting that increasing the size of the network does indeed increase validation accuracy but exponentially increase learning time.

I also tried to add some layers but resulting networks had difficulties to converge and didn’t give good results.

Learning

Be sure to look at my article on how to improve the performance of Theano before beginning the training as it can save you a fair amount of time.

For this training, I used a categorical cross-entropy loss function, as the final layer uses a softmax activation.

Algorithm

Instead of using a classic but often slow stochastic gradient descent (SGD) algorithm, I chose to use AdaMax and AdaDelta. I found that AdaMax tends to give better results than AdaDelta for this problem. However, for complex networks with numerous filters and large fully-connected layers, AdaMax struggles to converge for the first epochs or even fails completely to converge. For my network, I use a few epochs (~20) of AdaDelta before using AdaMax. It is not necessary to use this strategy if you divided by 2 or more the size of the network.

Batch Size

While keeping the number of epochs constant, I tried to change the batch size. Bigger batches make the algorithm run faster but with lower results. It may comes from the fact that while the same amount of data is used, bigger batches means less updates. Anyway, best results where achieved with a batch size of 128.

Layer Initialization

The optimization algorithm can fail to find an optimum if the network is not properly initialized. I found that using the initialization presented by He made the learning much easier. In Keras, you just have to use the init='he_normal' parameter for each layer.

Learning Rate Decay

It is usually a good idea to decrease the learning rate during the training. It allows the algorithm to fine tune the parameters and to get closer to a local minimum. However, I found that with AdaMax, results were better without learning rate decay, so we don’t have to worry about that now.

Number of Epochs

Using a batch size of 128 and no learning rate decay, I tested from 200 to 500 epochs. Even at 500 epochs, the network doesn’t seem to overfit, certainly thanks to dropout. I found that 500 epochs give slightly better results than 300 epochs. I went for 500 epochs, but if you’re in a hurry or running on a CPU, 300 epochs is sufficient.

Cross-Validation

To evaluate the quality of the different models and hyper-parameters influence, I use a Monte Carlo cross-validation: I randomly split the initial data 1/4 for validation and 3/4 for learning. I also use stratification which is a splitting technique ensuring that, in our case, 1/4 of images of each class are present in the testing set. This results in a more stable validation score.

Submission

In Kaggle competitions, the use of several models to get the final prediction is now widespread. These methods, called ensembling, have proven to be successful. Two of these are often used: averaging and stacking. I went for the averaging method as it is simpler. I will surely try stacking in the next competitions. Note that ensembling methods work best with different types of model but they can also slightly increase the results for neural nets.

There is still a question I asked myself: do I still need validation sets for these models or is it better to learn on the whole training data and pick the last model? I tested those two methods by splitting with stratification before the learning, to get a test set. The results are clear: using validation sets, the averaged result scores 1% better than without validation sets.

For my submission, I generated 18 predictions in about 36 hours using my GTX 960. All those predictions were used in the averaging. I didn’t test the impact of the number of predictions on the final result but it is obvious that each new prediction has a less and less impact on the final result.

The model reaching the highest validation accuracy scored 0.8623, which is lower than the final score obtained in Kaggle, using averaging. It’s highly likely that averaging increased the final score by around 0.01.

The Code

Augmentation, Model, Learning and Prediction

Averaging

Conclusion

kagglefirstplace

After submitting to Kaggle, you should have a score around 0.8678 which should get you to the top of the leaderboard (avg_pred.csv). However, you must ask yourself: how can I score higher?

Well, the CNN does few mistakes a human wouldn’t do, so there is little room for improvement without overfitting to the dataset. The main source of errors is confusions between:

  • lower and capital letters
  • 1, l and I (one, lower L and capital i)
  • 0 and O (zero and capital o)
  • Letters rotated by 90 and other letters

Humans would distinguish between all this case with the help of the context. Unfortunately we do not have any context in the dataset.

However, there might be a way to detect and better classify letters rotated by 90 . By training an algorithm to classify straight letters and rotated letters, one can preprocess the datasets to unrotate the rotated images. To be tested…

7 Comments

  1. zorro

    channel_flip = True, # You must modify the ImageDataGenerator class for that parameter to work.

    Can you please explain this

    • Fabien Tencé

      Hi,

      See the section “Modifications to ImageDataGenerator” for an explanation.

  2. Siva

    Hi,

    I forked your model and tried to train a model. The accuracy is very low 0.07 and it got stuck.

    Can you please explain why ?

    Thanks,
    Siva N B

    • Fabien Tencé

      Hi Siva,

      Did you use the exact same code as in this article?
      The model struggles to converge if you only use AdaMax. If you use AdaDelta at the beginning, it should converge.
      You can try to make the model simpler to see if there is a problem: divide the first parameter of Convolution2D and Dense layers by 2, 4 or 8. If at /8 it does not converge, there is a problem, check the data.

    • Rob

      Were you able to remedy this problem. I am running into the same issue.

      • Siva

        Yes. The problem was with the image dim ordering. If you are using Keras with TensorFlow, change “image_dim_ordering: th” in ~/.keras/keras.json

        @Fabien, I use the same approach as yours, two different optimizers. I was able to achieve 99.7% accuracy with validation set but when I tried with a new Test set, i got almost 0% accuracy.

        Is the model overfitting ? Is it possible for you to post your predictions for the Kaggle Testing set ?

        Thanks,
        Siva N B

        • Fabien Tencé

          You’re right, I forgot about the dimension ordering! The code works with a Theano backend. With few modifications if should work with TensorFlow.

          I wouldn’t recommend changing keras.json as you could have issues with other projects: keep image_dim_ordering: tf if you are using TensorFlow. I think the simplest way is to replace data = data[:,np.newaxis,:,:] with data = data[:,:,:,np.newaxis] in the preprocessing code. You also have to change input_shape=(1, img_rows, img_cols) to input_shape=(img_rows, img_cols, 1)

          The nicest way should be to modify X_train_all and input_shape depending on the backend, I’ll do that soon.

          It’s a nice idea to use several optimizers. If you got 0% accuracy it’s a good sign as random prediction should give you a higher accuracy. I hope the modifications resolve your issue.

          99.7% accuracy may indicate overfitting but you shouldn’t get 0% on the test set, it is most certainly a problem in the data. I’ll add the predictions in the article as rules specify that shared information must be public.

          edit: the prediction for the test set is in the conclusion section.

Leave a Reply

Theme by Anders Norén