Error building Caffe with Python 3 support

Caffe can be built with support for Python 2.x. This allows you to invoke Caffe easily from Python code. However, I wanted to call Caffe from Python 3.x code.

  • I built Boost with Python 3.x support. I could see that libboost_python3 library files were generated.

  • I added this to the normal CMake command that I use to build Caffe: -Dpython_version=3

Sadly, this popped up errors of this type:

libboost_python.so: undefined reference to `PyClass_Type'

This type of error indicates that the Python 2.x Boost library was being used to compile with Python 3.x libraries.

Advertisements

Loader not finding shared library in same directory

Problem

I had an executable file that executed correctly on a computer. On a different computer with same setup, the dynamic loader that tries to load the shared libraries required by this executable would fail with this error:

error while loading shared libraries: libmatrixio.so: cannot open shared object file: No such file or directory

The surprising part is that the libmatrixio.so shared library file was always present in the same directory as the executable in both computers! It executed on one, but failed to find that file on another computer!

Solution

It turns out that the dynamic loader ld.so on Linux does not look in the current working directory by default. To make it look in the current directory, you need to add . to LD_LIBRARY_PATH. More information on this can be found in the ld.so manpage, look for LD_LIBRARY_PATH. After making this change to the shell environment variable, the executable loaded correctly on the other computer.

Tried with: Ubuntu 14.04

fish_pwd error

Problem

I updated the Fish shell using the apt command:

$ sudo apt install fish

After the update, Fish threw up this error:

fish: __fish_pwd
      ^
in command substitution
    called on standard input


in command substitution
    called on standard input

Solution

This turned out to be because the /usr/share/fish/functions/__fish_pwd.fish function file it was looking for could not be found! The apt update had screwed up and had deleted some essential configuration files of Fish.

I removed and reinstalled Fish shell:

$ sudo apt purge fish
$ sudo apt install fish

I noticed that the above function file was correctly installed and available now. Fish worked fine after this 🙂

Tried with: Fish 2.4.0 and Ubuntu 14.04

Selenium error on geckodriver

Problem

I updated Ubuntu packages, which included Firefox and I updated Selenium using pip3 cause that also depends on the Firefox version. But running my existing Python scripts that use Selenium popped this error:

FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver'

Solution

Firefox now provides the geckodriver as a separate binary. You can download the version matching your OS and CPU here. Unzip the file and place the binary anywhere that is in your PATH. Your Python scripts should work now.

Reference

Tried with: Firefox 49, Selenium 3.0.1 and Ubuntu 16.04

Out of memory error with py-faster-rcnn

Problem

When training a Faster-RCNN model using py-faster-rcnn I noticed that it would randomly crash with this error:

out of memory
invalid argument
an illegal memory access was encountered
F0919 18:01:51.657281 21310 math_functions.cu:81] Check failed: error == cudaSuccess (77 vs. 0)  an illegal memory access was encountered
*** Check failure stack trace: ***

Investigation

On closer investigation, I found that a training process would crash when I ran a second or third training on the same GPU. How can running a second training kill the first training? The only scenario I could think of, based on the above error message, was that a cudaMalloc to get more memory was failing. But why was Caffe trying to get more memory in the middle of training. Should not the reshape all be done and finished in the beginning of training?

Anyway, the first problem was to reliably reproduce the error, since it did not always crash on running a second training. It only crashed once in a few times in these scenarios. Since I suspected cudaMalloc, I wrote a small adversary CUDA program that would try to grab as much GPU memory as possible. I ran this program a while after training had started and it reliably crashed the training everytime!

A core dump file was being generated on crash, but it was at first useless since I was running Caffe compiled in release mode with no debugging symbols. I recompiled Caffe in debug mode with debugging symbols and used that to open in GDB:

$ gdb /usr/bin/python core

After the core dump was loaded in GDB, I got its backtrace using bt. It was interesting, but did not point to anything suspicious.

I next monitored GPU memory occupied by the training continuously, using this watch and nvidia-smi:

$ watch -n 0.1 nvidia-smi

I noticed that the GPU memory used by training incremented and decremented by around 18MB consistently all the time. If my adversary CUDA program went and grabbed the 18MB that was released, then the training would crash when it tried to alloc that same memory next time.

So, who is allocating and releasing memory all the time in py-faster-rcnn? Since I had ported the proposal layer recently from Python to C++, I remembered NMS. There is both a CPU and GPU version of NMS in py-faster-rcnn. The GPU NMS is used by default, though this can be changed in config.py. By switching it to CPU, I found that the crash no longer happened.

But the problem is that CPU NMS at 0.2s was 10 times slower than the GPU NMS at 0.02s for my setup.

Solution

Once I saw that the GPU NMS code in lib/nms/nms_kernel.cu was doing the cudaMalloc and cudaFree continuously, fixing it was easy. I changed the allocated memory pointers to static and changed the code to hold on to the memory allocated last time. Only if more memory was required then the old one would be freed and new larger one allocated. I basically used the same strategy used by std:vector, that is doubling of memory. A better solution would be to allocate the maximum required memory, based on the box numbers set in config.py, and use it during training.

Tried with: CUDA 7.5 and Ubuntu 14.04

Kubuntu install stuck with unmet dependencies

Problem

I tried to install Kubuntu on an existing Ubuntu system using this command:

$ sudo apt install kubuntu-desktop

And I got this package dependency error:

You might want to run 'apt-get -f install' to correct these:
The following packages have unmet dependencies:
 kde-telepathy-minimal : Depends: kde-config-telepathy-accounts (>= 15.04.0) but it is not going to be installed
 unity-scope-gdrive : Depends: account-plugin-google but it is not going to be installed
E: Unmet dependencies. Try 'apt-get -f install' with no packages (or specify a solution).

However, running sudo apt-get -f install would stop with the same dependency problem.

Solution

The key here is to realize that apt itself cannot resolve this cyclic dependency. So, to fix it we need to use a lower-level tool to explicitly take out the offending package. We can do that by using dpkg:

$ sudo dpkg --purge unity-scope-gdrive
$ sudo dpkg --purge account-plugin-google
$ sudo apt-get -f install

Tried with: Ubuntu 15.10

SSH unprotected private key file error

Problem

I tried to SSH to a server using a private key file and got this error:

$ ssh -i myprivate.key 10.0.0.100
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0664 for '/home/joe/myprivate.key' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
bad permissions: ignore key: /home/joe/myprivate.key

Solution

This key file strangely had access permissions by all to read, write and execute! SSH was complaining that such a file is too open and could be compromised. I reduced the access permissions to just read by me:

$ chmod 0400 myprivate.key

SSH worked after this change 🙂

Tried with: SSH 6.6 and Ubuntu 14.04

Missing argument to exec

Problem

I run a find command at the Fish shell and get this error:

$ find . -type f -exec sed -i 's/foo/bar/g' {} +
find: missing argument to `-exec`

Solution

This command has nothing wrong with it, it works under Bash. Fish expands the curly braces by default. So, for this to work, just enclose the curly braces in single quotes, so that it is not expanded. More details here.

Tried with: Fish 2.2.0 and Ubuntu 14.04

Cannot import name _tkagg

Problem

I had Matplotlib installed on a computer. I tried to set the backend for a plot using TkAgg. There was no Tk installed, so I installed the required packages. When I tried to set backend again, I got this error: ImportError: cannot import name _tkagg

Solution

Check if your Matplotlib is installed using pip. If so, then you need to reinstall Matplotlib so that it picks up links to the python-tk files correctly:

$ sudo pip uninstall matplotlib
$ sudo pip install matplotlib

The plot was displayed correctly after this.

Tried with: Ubuntu 14.04

Matplotlib plot is not displayed in window

Problem

I created a plot using the Matplotlib library in a Python script. But the call to show does not display the plot in a GUI window.

Solution

The rendering of a plot to a file or display is controlled by the backend that is set in Matplotlib. You can check the current backend using:

import matplotlib
matplotlib.get_backend()

I got the default backend as Agg. The possible values for GUI backends on Linux are Qt4Agg, GTKAgg, WXagg, TKAgg and GTK3Agg. Since Agg is not a GUI backend, nothing is being displayed.

I wanted to use the simple Tcl-Tk backend. So, I installed the necessary packages for Python:

$ sudo apt install tcl-dev tk-dev python-tk python3-tk

The backend is not set automatically after this. In my Python script, I set it explicitly:

import matplotlib
matplotlib.rcParams["backend"] = "TkAgg"

The plot was displayed after this change.

However, this needs to be set immediately after the import line of Matplotlib and before importing matplotlib.pyplot. Doing this in the import region of a Python script is quite ugly.

Instead, I like to switch the backend of the matplotlib.pyplot object itself:

import matplotlib.pyplot as mplot
mplot.switch_backend("TkAgg")

This too worked fine for me! 🙂

Reference: Matplotlib figures not showing up or displaying

Tried with: Ubuntu 14.04