When training a Faster-RCNN model using py-faster-rcnn I noticed that it would randomly crash with this error:
out of memory invalid argument an illegal memory access was encountered F0919 18:01:51.657281 21310 math_functions.cu:81] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered *** Check failure stack trace: ***
On closer investigation, I found that a training process would crash when I ran a second or third training on the same GPU. How can running a second training kill the first training? The only scenario I could think of, based on the above error message, was that a
cudaMalloc to get more memory was failing. But why was Caffe trying to get more memory in the middle of training. Should not the reshape all be done and finished in the beginning of training?
Anyway, the first problem was to reliably reproduce the error, since it did not always crash on running a second training. It only crashed once in a few times in these scenarios. Since I suspected
cudaMalloc, I wrote a small adversary CUDA program that would try to grab as much GPU memory as possible. I ran this program a while after training had started and it reliably crashed the training everytime!
A core dump file was being generated on crash, but it was at first useless since I was running Caffe compiled in release mode with no debugging symbols. I recompiled Caffe in debug mode with debugging symbols and used that to open in GDB:
$ gdb /usr/bin/python core
After the core dump was loaded in GDB, I got its backtrace using
bt. It was interesting, but did not point to anything suspicious.
I next monitored GPU memory occupied by the training continuously, using this watch and nvidia-smi:
$ watch -n 0.1 nvidia-smi
I noticed that the GPU memory used by training incremented and decremented by around 18MB consistently all the time. If my adversary CUDA program went and grabbed the 18MB that was released, then the training would crash when it tried to alloc that same memory next time.
So, who is allocating and releasing memory all the time in py-faster-rcnn? Since I had ported the proposal layer recently from Python to C++, I remembered NMS. There is both a CPU and GPU version of NMS in py-faster-rcnn. The GPU NMS is used by default, though this can be changed in
config.py. By switching it to CPU, I found that the crash no longer happened.
But the problem is that CPU NMS at 0.2s was 10 times slower than the GPU NMS at 0.02s for my setup.
Once I saw that the GPU NMS code in
lib/nms/nms_kernel.cu was doing the
cudaFree continuously, fixing it was easy. I changed the allocated memory pointers to
static and changed the code to hold on to the memory allocated last time. Only if more memory was required then the old one would be freed and new larger one allocated. I basically used the same strategy used by
std:vector, that is doubling of memory. A better solution would be to allocate the maximum required memory, based on the box numbers set in
config.py, and use it during training.
Tried with: CUDA 7.5 and Ubuntu 14.04