Ankivil

Machine Learning Experiments

Software

Making Theano Faster with CuDNN and CNMeM on Windows 10

Introduction

After installing Theano and CUDA (see my previous article) you can tweak your configuration to substantially increase the speed of your networks. There are no or few drawbacks, the only condition is to use Theano with CUDA. This article specifically addresses the problem of having the following message each time you load Theano: (CNMeM is disabled, CuDNN not available).

Installing CuDNN

CuDNN is a library for CUDA, developed by NVIDIA, which provides highly tuned implementations of primitives for deep neural networks. CuDNN is said to make deep nets run faster and sometimes using less memory.

Here are the step to make Theano use CuDNN:

  1. Register at NVIDIA cuDNN.
  2. Download cuDNN v5 Library for Windows 10.
  3. Extract the archive and copy the 3 directories ( bin , include and lib ) in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5  or  C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0 . If you’re unsure which version of CUDA Theano uses, check your  CUDA_PATH  environment variable.
  4. Open a python console and enter import theano . You should see the message (CNMeM is disabled, CuDNN 5004) . If you have errors, try upgrading Theano using the bleeding-edge version (see here), especially if you have the message UserWarning: Your CuDNN version is more recent then Theano. If you see problems, try updating Theano or downgrading CuDNN to version 4. You may also check that your .theanorc  file does not contain:

About The .theanorc File

The .theanorc  file should be located in your user profile. As you cannot create a file beginning with a dot in the file explorer, you have to open a terminal and type:

Then you can open the .theanorc  file and add the following lines:

Depending on your installation setup, you may also have to add:

You have to adjust the paths to where you have installed Anaconda and CUDA.  Be careful with the nvcc flags as they do not handle spaces, you have to replace them with = .

Using CNMeM

CNMeM is a library, developed by NVIDIA, which helps deep learning frameworks managing CUDA memory. CNMeM is already integrated in Theano so you don’t have to install anything. To enable CNMeM in Theano, you have to add to the .theanorc  file the lines:

The cnmem value specifies the amount of GPU memory allocated for Theano. To quote the documentation:

  • 0: not enabled.
  • 0 < N <= 1: use this fraction of the total GPU memory (clipped to .95 for driver memory). [Note: This should be a float value, for instance 0.25 or 1.0]
  • > 1: use this number in megabytes (MB) of memory.

In theory, for cards dedicated to deep learning you should put the value to 1.0. For cards used also to render to the screen you should put a value around 0.8, it depends what applications you use while Theano runs (internet browser, etc.) and the size of your GPU memory.

In practice, using a graphics card with 4Go of memory, I couldn’t go over 3395Mo exactly which is approximately 83% of the memory. Whether the graphics card was dedicated to deep learning or plugged to a monitor did not change this value.  Of course, all the software I used to monitor the GPU memory usage (GPU-Z and nvidia-smi) reported that several hundreds of MB of memory were left free so it must be a bug with CNMeM, Theano, my card or the combination of all three.

Be careful! If you enable CNMeM, you should put the highest possible value as low values could result in memory fragmentation, see the “Performance Gains” section.

To test is Theano uses CNMeM, open a Python interpreter and type  import theano . You should see  (CNMeM is enabled with initial size: 80.0% of memory, CuDNN 5004)  if everything’s OK.

About Installing OpenBLAS

OpenBLAS is an optimized Basic Linear Algebra Subprograms library. Usually, Theano uses the default BLAS library through Numpy. It is also possible to link directly Theano to one of the fastest BLAS library: OpenBLAS.

Doing so makes Theano run significantly faster on CPU, it does not change the performances if you run Theano on GPU from what I measured. If you still want to give it a try, here are the steps:

  • Download mingw64_dll.zip
  • Download OpenBLAS-v0.2.14-Win64-int32.zip
  • Create a directory to put OpenBlas,  C:\openblas for instance
  • Copy the DLLs from mingw64_dll  and OpenBlas/bin  into C:\openblas
  • Add to following lines to your  .theanorc file:
  • Add C:\OpenBlas to the  PATH environment variable

You can test if there is any gain in speed using  the check_blas.py  in your Theano installation, under the Theano/misc directory. It indeed improved performance on the CPU, in my case, but it didn’t change a thing while running on the GPU (which is a lot faster).

Performance Gains

To evaluate how much the performance is improved by CuDNN and CNMeM, I ran several tests. The protocol is the following: time and memory usage is monitored for a single forward pass using random samples and increasing batch size.

VGG16

Here are the results for the VGG16 network:

cuddn_cnmem_combined_graph_vgg16

Speed

Both CuDNN and CnMeM improves significantly the speed of the network with CuDNN giving the highest speed boost. To achieve maximum performance, both CuDNN and CNMeM should be enabled.

Memory Usage

Using CuDNN does not seems to impact the memory usage. However, using CNMeM makes Theano use a constant amount of memory, no matter how small the batch size is. If you really want to optimize the amount of memory dedicated to CNMeM, you should adjust it depending on the size of the batch you are feeding to your network.

Memory Fragmentation

The Theano documentation warns of memory fragmentation if using low values for CNMeM. I tested this behavior using a CNMeM value of 0.4:

cuddn_cnmem_combined_graph_vgg16_frag

The lines for CNMeM(0.4) are truncated because for batch size over 40, Theano crashes because it needs too much memory. We can see that when CNMeM runs out of memory and there is enough available on the graphics card, the memory usage goes up. However, it increases faster than without CNMeM which may indicate that memory fragmentation happens. In this case, CNMeM is using much more memory when reallocating memory than when CNMeM has a big enough memory pool from the beginning.

MLP Gain

The last test is to see if CuDNN and CNMeM can improve the performance on very simple networks only composed of dense layers with ReLU and softmax activation functions:

cuddn_cnmem_combined_graph_mlp

In this case, the CuDNN library does not improve the speed but the CNMeM library does. So, while CuDNN provides the higher speed boost in convnets, CNMeM seems to increase the performance for most of the networks. Regarding the memory usage, the behavior is similar to the one for the VGG16 network.

Other Optimizations

There are other tweaks to make Theano run faster (see here). I did not notice any gain similar in magnitude to CuDNN and CNMeM, so I advice you to stick to these two.

Conclusion

Using Theano on the GPU with CUDA makes it run a lot faster than on the CPU. In this article, I showed that it is also possible to increase significantly the performance making Theano and CUDA  use CuDNN and CNMeM. Convnets seems to be the most impacted by this increase in performance but classic neural networks can be sped up too.

Enabling CuDNN does not seems to have any drawback. Regarding CNMeM you should try to enable it with maximum memory. Exceptions to this should be when you want to run several Theano sessions in parallel or if your graphics card is plugged to a monitor and you use graphics-heavy applications.

References

10 Comments

  1. Jan de Lange

    Thanks Fabien, great help!

    I can now choose to use cnem=1.0 with my monitor connected to the internal graphics card or cnem=0.8 if it is connected to the GPU (Gigabyte GTX 960 4GB).

    Your trick to deal with the space in “Program=files” is also helpful. Before I had to set flags via the THEANO environment variables.

    Jan

    • Fabien Tencé

      Hi Jan,

      You’re welcome. Like you, I connected my monitor to the internal graphics cards but I couldn’t use cnmem=1. It must be a problem with my setup because I have a Zotac GTX 960 4Go which is very similar to your card. I will try to upgrade all my drivers and software to see if it fixes the problem, thanks for the info.

  2. duy cuong

    Awesome instruction, tk u so much!!!
    Anyway, i followed ur instruction, but when i check numpy config, why its still show that openblas pack is “NOT AVAILABLE”?
    >>>import numpy as np
    >>>np.__config__.show()

    • Fabien Tencé

      Hi Cuong,

      By default, Theano will use the BLAS linked with numpy. But if you link Theano with OpenBLAS, Numpy is still linked to its default BLAS library. You have to rebuild numpy to be able to link it to another BLAS library. I’m not much help here because I have never done so.

  3. This is great and the improvements are significant!

    I’ve gone from 77secs/epoch to 23secs/epoch by enabling CNMeM and CuDNN on an AWS g2.2xlarge instance. I’ve been using the CNN here: https://github.com/fchollet/keras/blob/master/examples/cifar10_cnn.py

    Thanks!

  4. Vicky

    I followed your instructions to install CuDNN. But when I ‘import theano’, it still says cuDNN not available.

    • Fabien Tencé

      Hi Vicky,

      Check if the version of cuDNN you downloaded match your CUDA version. You can also open a command line and type:
      if exist "%CUDA_PATH%\include\cudnn.h" (echo OK) else (echo NOK)
      if exist "%CUDA_PATH%\lib\x64\cudnn.lib" (echo OK) else (echo NOK)
      if exist "%CUDA_PATH%\bin\cudnn64_5.dll" (echo OK) else (echo NOK)

      These lines should all print “OK”. If not, check your CUDA_PATH variable and check if you copied the directories in the right version of CUDA, in case you installed several versions. You can use echo %CUDA_PATH% to check that.

  5. Jeroen Devreese

    Following your instructions gave me a factor of 4 speed boost, fantastic!

  6. Le Dac Liem

    Big thanks from Vietnam!

  7. masa

    Hi I got an error saying that theano cannot find pygpu.
    Could you help me to solve the problem?

Leave a Reply

Theme by Anders Norén