Stub library warning on libnvidia-ml.so

Problem

I tried to run a program compiled with CUDA 9.0 inside a Docker container and got this error:

WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).

Solution

Let us first try to understand the error and where it is coming from. The program compiled with CUDA 9.0 has been linked to libnvidia-ml.so. This is the shared library file of the NVIDIA Management Library (NVML). During execution, libnvidia-ml.so is throwing this error. Why?

From the error message, we get an indication that there are two libnvidia-ml.so files. One is a stub that is used during compilation and linking. I guess it just provides the necessary function symbols and signatures. But that library cannot be used to execute the compiled executable. If we do try to execute with that stub shared library file, it will throw this warning.

So, there is a second libnvidia-ml.so, the real shared library file. It turns out that the management library is provided by the NVIDIA display driver. So, every version of display driver will have its own libnvidia-ml.so file. I had NVIDIA display driver 384.66 on my machine and I found libnvidia-ml.so under /usr/lib/nvidia-384. The stub library file allows you to compile on machines where the NVIDIA display driver is not installed. In our case, for some reason, the loader is picking up the stub instead of the real library file during execution.

By using the chrpath tool, described here, I found that the compiled binary did indeed have the stub library directory in its path:/usr/local/cuda/lib64/stubs. That directory did have a libnvidia-ml.so. Using the strings tool on that shared library, confirmed that it was the origin of the above message:

$ strings libnividia-ml.so | grep "You should always run with"

Since the binary has an RPATH, described here, with the stubs path, the stub library was getting picked up with high preference over the actual libnvidia-ml.so, which was present in . The solution I came up with for this problem was to add a command to the docker run invocation to delete the stubs directory:

$ rm -rf  /usr/local/cuda/lib64/stubs

That way, it was still available outside Docker for compilation. It would just appeared deleted inside the Docker container, thus forcing the loader to pick up the real libnvidia-ml.so during execution.

Advertisements

How to change RPATH or RUNPATH of executable

RPATH or RUNPATH is a colon-separated list of directories embedded in an executable. This list of directories play an important role when shared library file locations are determined at the time when the executable is loaded for running. This process is described in this post. Note that RPATH has highest priority in the shared library search, compared to RUNPATH. We can change RPATH or RUNPATH of a binary file by using the chrpath tool.

  • Installing this tool is easy:
$ sudo apt install chrpath
  • To view if the binary has RPATH or RUNPATH and to list its colon-separated list of directories:
$ chrpath ./some_binary
  • To remove RPATH or RUNPATH from the binary:
$ chrpath -d ./some_binary
  • To convert RPATH of a binary to a RUNPATH:
$ chrpath -c ./some_binary

Note that you cannot convert a RUNPATH back to RPATH.

  • To replace RPATH or RUNPATH paths with a different set of paths:
$ chrpath -r /home/joe:/home/foobar/lib64 ./some_binary

Note that the string of the new set of paths should be smaller or equal to the length of what was stored earlier in the binary.

Tried with: chrpath 0.14 and Ubuntu 16.04