After installing Theano and CUDA (see my previous article) you can tweak your configuration to substantially increase the speed of your networks. There are no or few drawbacks, the only condition is to use Theano with CUDA. This article specifically addresses the problem of having the following message each time you load Theano: (CNMeM is disabled, CuDNN not available).
CuDNN is a library for CUDA, developed by NVIDIA, which provides highly tuned implementations of primitives for deep neural networks. CuDNN is said to make deep nets run faster and sometimes using less memory.
Here are the step to make Theano use CuDNN:
- Register at NVIDIA cuDNN.
- Download cuDNN v5 Library for Windows 10.
- Extract the archive and copy the 3 directories ( bin , include and lib ) in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5 or C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0 . If you’re unsure which version of CUDA Theano uses, check your CUDA_PATH environment variable.
- Open a python console and enter
import theano . You should see the message
(CNMeM is disabled, CuDNN 5004) . If you have errors, try upgrading Theano using the bleeding-edge version (see here), especially if you have the message
UserWarning: Your CuDNN version is more recent then Theano. If you see problems, try updating Theano or downgrading CuDNN to version 4. You may also check that your
.theanorc file does not contain:
12[dnn]enabled = False
About The .theanorc File
The .theanorc file should be located in your user profile. As you cannot create a file beginning with a dot in the file explorer, you have to open a terminal and type:
type NUL > .theanorc
Then you can open the .theanorc file and add the following lines:
floatX = float32
device = gpu
Depending on your installation setup, you may also have to add:
compiler_bindir="C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin"
You have to adjust the paths to where you have installed Anaconda and CUDA. Be careful with the nvcc flags as they do not handle spaces, you have to replace them with = .
CNMeM is a library, developed by NVIDIA, which helps deep learning frameworks managing CUDA memory. CNMeM is already integrated in Theano so you don’t have to install anything. To enable CNMeM in Theano, you have to add to the .theanorc file the lines:
cnmem = 0.8
The cnmem value specifies the amount of GPU memory allocated for Theano. To quote the documentation:
- 0: not enabled.
- 0 < N <= 1: use this fraction of the total GPU memory (clipped to .95 for driver memory). [Note: This should be a float value, for instance 0.25 or 1.0]
- > 1: use this number in megabytes (MB) of memory.
In theory, for cards dedicated to deep learning you should put the value to 1.0. For cards used also to render to the screen you should put a value around 0.8, it depends what applications you use while Theano runs (internet browser, etc.) and the size of your GPU memory.
In practice, using a graphics card with 4Go of memory, I couldn’t go over 3395Mo exactly which is approximately 83% of the memory. Whether the graphics card was dedicated to deep learning or plugged to a monitor did not change this value. Of course, all the software I used to monitor the GPU memory usage (GPU-Z and nvidia-smi) reported that several hundreds of MB of memory were left free so it must be a bug with CNMeM, Theano, my card or the combination of all three.
Be careful! If you enable CNMeM, you should put the highest possible value as low values could result in memory fragmentation, see the “Performance Gains” section.
To test is Theano uses CNMeM, open a Python interpreter and type import theano . You should see (CNMeM is enabled with initial size: 80.0% of memory, CuDNN 5004) if everything’s OK.
About Installing OpenBLAS
OpenBLAS is an optimized Basic Linear Algebra Subprograms library. Usually, Theano uses the default BLAS library through Numpy. It is also possible to link directly Theano to one of the fastest BLAS library: OpenBLAS.
Doing so makes Theano run significantly faster on CPU, it does not change the performances if you run Theano on GPU from what I measured. If you still want to give it a try, here are the steps:
- Download mingw64_dll.zip
- Download OpenBLAS-v0.2.14-Win64-int32.zip
- Create a directory to put OpenBlas, C:\openblas for instance
- Copy the DLLs from mingw64_dll and OpenBlas/bin into C:\openblas
- Add to following lines to your
- Add C:\OpenBlas to the PATH environment variable
You can test if there is any gain in speed using the check_blas.py in your Theano installation, under the Theano/misc directory. It indeed improved performance on the CPU, in my case, but it didn’t change a thing while running on the GPU (which is a lot faster).
To evaluate how much the performance is improved by CuDNN and CNMeM, I ran several tests. The protocol is the following: time and memory usage is monitored for a single forward pass using random samples and increasing batch size.
Here are the results for the VGG16 network:
Both CuDNN and CnMeM improves significantly the speed of the network with CuDNN giving the highest speed boost. To achieve maximum performance, both CuDNN and CNMeM should be enabled.
Using CuDNN does not seems to impact the memory usage. However, using CNMeM makes Theano use a constant amount of memory, no matter how small the batch size is. If you really want to optimize the amount of memory dedicated to CNMeM, you should adjust it depending on the size of the batch you are feeding to your network.
The Theano documentation warns of memory fragmentation if using low values for CNMeM. I tested this behavior using a CNMeM value of 0.4:
The lines for CNMeM(0.4) are truncated because for batch size over 40, Theano crashes because it needs too much memory. We can see that when CNMeM runs out of memory and there is enough available on the graphics card, the memory usage goes up. However, it increases faster than without CNMeM which may indicate that memory fragmentation happens. In this case, CNMeM is using much more memory when reallocating memory than when CNMeM has a big enough memory pool from the beginning.
The last test is to see if CuDNN and CNMeM can improve the performance on very simple networks only composed of dense layers with ReLU and softmax activation functions:
In this case, the CuDNN library does not improve the speed but the CNMeM library does. So, while CuDNN provides the higher speed boost in convnets, CNMeM seems to increase the performance for most of the networks. Regarding the memory usage, the behavior is similar to the one for the VGG16 network.
There are other tweaks to make Theano run faster (see here). I did not notice any gain similar in magnitude to CuDNN and CNMeM, so I advice you to stick to these two.
Using Theano on the GPU with CUDA makes it run a lot faster than on the CPU. In this article, I showed that it is also possible to increase significantly the performance making Theano and CUDA use CuDNN and CNMeM. Convnets seems to be the most impacted by this increase in performance but classic neural networks can be sped up too.
Enabling CuDNN does not seems to have any drawback. Regarding CNMeM you should try to enable it with maximum memory. Exceptions to this should be when you want to run several Theano sessions in parallel or if your graphics card is plugged to a monitor and you use graphics-heavy applications.