I tried to run a program compiled with CUDA 9.0 inside a Docker container and got this error:
WARNING: You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in GDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
Let us first try to understand the error and where it is coming from. The program compiled with CUDA 9.0 has been linked to libnvidia-ml.so. This is the shared library file of the NVIDIA Management Library (NVML). During execution, libnvidia-ml.so is throwing this error. Why?
From the error message, we get an indication that there are two libnvidia-ml.so files. One is a stub that is used during compilation and linking. I guess it just provides the necessary function symbols and signatures. But that library cannot be used to execute the compiled executable. If we do try to execute with that stub shared library file, it will throw this warning.
So, there is a second libnvidia-ml.so, the real shared library file. It turns out that the management library is provided by the NVIDIA display driver. So, every version of display driver will have its own libnvidia-ml.so file. I had NVIDIA display driver 384.66 on my machine and I found libnvidia-ml.so under
/usr/lib/nvidia-384. The stub library file allows you to compile on machines where the NVIDIA display driver is not installed. In our case, for some reason, the loader is picking up the stub instead of the real library file during execution.
By using the chrpath tool, described here, I found that the compiled binary did indeed have the stub library directory in its path:
/usr/local/cuda/lib64/stubs. That directory did have a libnvidia-ml.so. Using the strings tool on that shared library, confirmed that it was the origin of the above message:
$ strings libnividia-ml.so | grep "You should always run with"
Since the binary has an RPATH, described here, with the stubs path, the stub library was getting picked up with high preference over the actual libnvidia-ml.so, which was present. The solution I came up with for this problem was to add a command to the docker run invocation to delete the stubs directory:
$ rm -rf /usr/local/cuda/lib64/stubs
That way, it was still available outside Docker for compilation. It would just appeared deleted inside the Docker container, thus forcing the loader to pick up the real libnvidia-ml.so during execution.