Out of memory error with py-faster-rcnn

Problem

When training a Faster-RCNN model using py-faster-rcnn I noticed that it would randomly crash with this error:

out of memory
invalid argument
an illegal memory access was encountered
F0919 18:01:51.657281 21310 math_functions.cu:81] Check failed: error == cudaSuccess (77 vs. 0)  an illegal memory access was encountered
*** Check failure stack trace: ***

Investigation

On closer investigation, I found that a training process would crash when I ran a second or third training on the same GPU. How can running a second training kill the first training? The only scenario I could think of, based on the above error message, was that a cudaMalloc to get more memory was failing. But why was Caffe trying to get more memory in the middle of training. Should not the reshape all be done and finished in the beginning of training?

Anyway, the first problem was to reliably reproduce the error, since it did not always crash on running a second training. It only crashed once in a few times in these scenarios. Since I suspected cudaMalloc, I wrote a small adversary CUDA program that would try to grab as much GPU memory as possible. I ran this program a while after training had started and it reliably crashed the training everytime!

A core dump file was being generated on crash, but it was at first useless since I was running Caffe compiled in release mode with no debugging symbols. I recompiled Caffe in debug mode with debugging symbols and used that to open in GDB:

$ gdb /usr/bin/python core

After the core dump was loaded in GDB, I got its backtrace using bt. It was interesting, but did not point to anything suspicious.

I next monitored GPU memory occupied by the training continuously, using this watch and nvidia-smi:

$ watch -n 0.1 nvidia-smi

I noticed that the GPU memory used by training incremented and decremented by around 18MB consistently all the time. If my adversary CUDA program went and grabbed the 18MB that was released, then the training would crash when it tried to alloc that same memory next time.

So, who is allocating and releasing memory all the time in py-faster-rcnn? Since I had ported the proposal layer recently from Python to C++, I remembered NMS. There is both a CPU and GPU version of NMS in py-faster-rcnn. The GPU NMS is used by default, though this can be changed in config.py. By switching it to CPU, I found that the crash no longer happened.

But the problem is that CPU NMS at 0.2s was 10 times slower than the GPU NMS at 0.02s for my setup.

Solution

Once I saw that the GPU NMS code in lib/nms/nms_kernel.cu was doing the cudaMalloc and cudaFree continuously, fixing it was easy. I changed the allocated memory pointers to static and changed the code to hold on to the memory allocated last time. Only if more memory was required then the old one would be freed and new larger one allocated. I basically used the same strategy used by std:vector, that is doubling of memory. A better solution would be to allocate the maximum required memory, based on the box numbers set in config.py, and use it during training.

Tried with: CUDA 7.5 and Ubuntu 14.04

Advertisements

Kubuntu install stuck with unmet dependencies

Problem

I tried to install Kubuntu on an existing Ubuntu system using this command:

$ sudo apt install kubuntu-desktop

And I got this package dependency error:

You might want to run 'apt-get -f install' to correct these:
The following packages have unmet dependencies:
 kde-telepathy-minimal : Depends: kde-config-telepathy-accounts (>= 15.04.0) but it is not going to be installed
 unity-scope-gdrive : Depends: account-plugin-google but it is not going to be installed
E: Unmet dependencies. Try 'apt-get -f install' with no packages (or specify a solution).

However, running sudo apt-get -f install would stop with the same dependency problem.

Solution

The key here is to realize that apt itself cannot resolve this cyclic dependency. So, to fix it we need to use a lower-level tool to explicitly take out the offending package. We can do that by using dpkg:

$ sudo dpkg --purge unity-scope-gdrive
$ sudo dpkg --purge account-plugin-google
$ sudo apt-get -f install

Tried with: Ubuntu 15.10

SSH unprotected private key file error

Problem

I tried to SSH to a server using a private key file and got this error:

$ ssh -i myprivate.key 10.0.0.100
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0664 for '/home/joe/myprivate.key' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
bad permissions: ignore key: /home/joe/myprivate.key

Solution

This key file strangely had access permissions by all to read, write and execute! SSH was complaining that such a file is too open and could be compromised. I reduced the access permissions to just read by me:

$ chmod 0400 myprivate.key

SSH worked after this change 🙂

Tried with: SSH 6.6 and Ubuntu 14.04

Missing argument to exec

Problem

I run a find command at the Fish shell and get this error:

$ find . -type f -exec sed -i 's/foo/bar/g' {} +
find: missing argument to `-exec`

Solution

This command has nothing wrong with it, it works under Bash. Fish expands the curly braces by default. So, for this to work, just enclose the curly braces in single quotes, so that it is not expanded. More details here.

Tried with: Fish 2.2.0 and Ubuntu 14.04

Cannot import name _tkagg

Problem

I had Matplotlib installed on a computer. I tried to set the backend for a plot using TkAgg. There was no Tk installed, so I installed the required packages. When I tried to set backend again, I got this error: ImportError: cannot import name _tkagg

Solution

Check if your Matplotlib is installed using pip. If so, then you need to reinstall Matplotlib so that it picks up links to the python-tk files correctly:

$ sudo pip uninstall matplotlib
$ sudo pip install matplotlib

The plot was displayed correctly after this.

Tried with: Ubuntu 14.04

Matplotlib plot is not displayed in window

Problem

I created a plot using the Matplotlib library in a Python script. But the call to show does not display the plot in a GUI window.

Solution

The rendering of a plot to a file or display is controlled by the backend that is set in Matplotlib. You can check the current backend using:

import matplotlib
matplotlib.get_backend()

I got the default backend as Agg. The possible values for GUI backends on Linux are Qt4Agg, GTKAgg, WXagg, TKAgg and GTK3Agg. Since Agg is not a GUI backend, nothing is being displayed.

I wanted to use the simple Tcl-Tk backend. So, I installed the necessary packages for Python:

$ sudo apt install tcl-dev tk-dev python-tk python3-tk

The backend is not set automatically after this. In my Python script, I set it explicitly:

import matplotlib
matplotlib.rcParams["backend"] = "TkAgg"

The plot was displayed after this change.

However, this needs to be set immediately after the import line of Matplotlib and before importing matplotlib.pyplot. Doing this in the import region of a Python script is quite ugly.

Instead, I like to switch the backend of the matplotlib.pyplot object itself:

import matplotlib.pyplot as mplot
mplot.switch_backend("TkAgg")

This too worked fine for me! 🙂

Reference: Matplotlib figures not showing up or displaying

Tried with: Ubuntu 14.04

MATLAB parallel pool error

Problem

I ran a MATLAB script that uses a parallel pool of workers. It failed with this error:

The client lost connection to worker 4. This might be due to network problems, or the interactive communicating job might have errored

The corresponding matlab_crash_dump file had this stack trace starting from mkl.so:

[  0] 0x00007f995128a1c5        /usr/local/MATLAB/R2014a/bin/glnxa64/mkl.so+16769477 mkl_blas_avx_sgemm_mscale+00001253
[  1] 0x00007f995115ba1c        /usr/local/MATLAB/R2014a/bin/glnxa64/mkl.so+15530524 mkl_blas_avx_xsgemm+00000204
[  2] 0x00007f99505a1a5c        /usr/local/MATLAB/R2014a/bin/glnxa64/mkl.so+03234396 mkl_blas_xsgemm+00000316
[  3] 0x00007f9950529720        /usr/local/MATLAB/R2014a/bin/glnxa64/mkl.so+02742048
[  4] 0x00007f9950525d5a        /usr/local/MATLAB/R2014a/bin/glnxa64/mkl.so+02727258 mkl_blas_sgemm+00001386
[  5] 0x00007f99503b8ba5        /usr/local/MATLAB/R2014a/bin/glnxa64/mkl.so+01231781 sgemm+00000377
[  6] 0x00007f991b803154    /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so+01315156 cblas_sgemm+00000372

Solution

To run a parallel pool of workers, we need to use the libmkl_rt.so provided by Intel. I added the path of this library path, which was /opt/intel/mkl/lib/intel64 in my case, to my LD_LIBRARY_PATH. I also set the BLAS_VERSION environment variable to libmkl_rt.so.

After this I checked if everything worked fine by going to Parallel -> Manage cluster profiles -> Local -> Validation profiles -> Validate. All the parallel test tasks were validated successfully. My script too worked fine after this.

Tried with: MATLAB R2014a and Ubuntu 14.04

Unable to parse command history line in MATLAB

Problem

Everytime I start MATLAB it reports an error in my command history with the error message: Unable to parse command history line

Solution

Open the file History.xml in the MATLAB Preferences directory. You can find this directory using the command prefdir. Either find the offending entry in this XML file and delete it or just delete the entire file itself.

Tried with: MATLAB R2014a and Ubuntu 14.04

AC_PROG_LIBTOOL error

Problem

I was trying to build GPerfTools by running its autogen.sh script when I got this error:

$ ./autogen.sh 
configure.ac:154: error: possibly undefined macro: AC_PROG_LIBTOOL
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
autoreconf: /usr/bin/autoconf failed with exit status: 1

Solution

I was mistaken in thinking that AC_PROG_LIBTOOL was referring to some generic library tool. Actually, there is a GNU LibTool script that is used to build libraries.

To install it:

$ sudo apt install libtool

The build worked fine after installing this package.

Tried with: Ubuntu 14.04

Menu not visible in Zeal

Problem

I installed Zeal as described here. When I opened Zeal, the main menu was not visible!

Solution

This is a problem caused by the integrated menu bar used by Qt5. As described here, removing the appmenu-qt5 solves this problem immediately:

$ sudo apt-get remove appmenu-qt5

Tried with: Ubuntu 14.04