nvidia-smi cheatsheet

nvidia-smi (NVIDIA System Management Interface) is a tool to query, monitor and configure NVIDIA GPUs. It ships with and is installed along with the NVIDIA driver and it is tied to that specific driver version. It is a tool written using the NVIDIA Management Library (NVML).

  • To query the usage of all your GPUs:
$ nvidia-smi

I use this default invocation to check:

  • Version of driver.
  • Names of the GPUs.
  • Index of the GPUs, based on PCI Bus Order. This is different from the CUDA order.
  • Amount of memory each of the GPUs has.
  • Whether persistence mode is enabled on each of the GPUs
  • Utilization of each of the GPUs (if I’m running something on them).
  • List of processes executing on the GPUs.
  • To query the configuration parameters of all the GPUs:
$ nvidia-smi -q

I use this to check:

  • Default clocks (listed under Default Application Clocks).
  • Current clocks (listed under Application Clocks).
  • To query the configuration parameters of a particular GPU, use its index:
$ nvidia-smi -q -i 0
Advertisements

Stub library warning on libnvidia-ml.so

Problem

I tried to run a program compiled with CUDA 9.0 inside a Docker container and got this error:

WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).

Solution

Let us first try to understand the error and where it is coming from. The program compiled with CUDA 9.0 has been linked to libnvidia-ml.so. This is the shared library file of the NVIDIA Management Library (NVML). During execution, libnvidia-ml.so is throwing this error. Why?

From the error message, we get an indication that there are two libnvidia-ml.so files. One is a stub that is used during compilation and linking. I guess it just provides the necessary function symbols and signatures. But that library cannot be used to execute the compiled executable. If we do try to execute with that stub shared library file, it will throw this warning.

So, there is a second libnvidia-ml.so, the real shared library file. It turns out that the management library is provided by the NVIDIA display driver. So, every version of display driver will have its own libnvidia-ml.so file. I had NVIDIA display driver 384.66 on my machine and I found libnvidia-ml.so under /usr/lib/nvidia-384. The stub library file allows you to compile on machines where the NVIDIA display driver is not installed. In our case, for some reason, the loader is picking up the stub instead of the real library file during execution.

By using the chrpath tool, described here, I found that the compiled binary did indeed have the stub library directory in its path:/usr/local/cuda/lib64/stubs. That directory did have a libnvidia-ml.so. Using the strings tool on that shared library, confirmed that it was the origin of the above message:

$ strings libnividia-ml.so | grep "You should always run with"

Since the binary has an RPATH, described here, with the stubs path, the stub library was getting picked up with high preference over the actual libnvidia-ml.so, which was present in . The solution I came up with for this problem was to add a command to the docker run invocation to delete the stubs directory:

$ rm -rf  /usr/local/cuda/lib64/stubs

That way, it was still available outside Docker for compilation. It would just appeared deleted inside the Docker container, thus forcing the loader to pick up the real libnvidia-ml.so during execution.