nvidia-smi cheatsheet

nvidia-smi (NVIDIA System Management Interface) is a tool to query, monitor and configure NVIDIA GPUs. It ships with and is installed along with the NVIDIA driver and it is tied to that specific driver version. It is a tool written using the NVIDIA Management Library (NVML).

  • To query the usage of all your GPUs:
$ nvidia-smi

I use this default invocation to check:

  • Version of driver.
  • Names of the GPUs.
  • Index of the GPUs, based on PCI Bus Order. This is different from the CUDA order.
  • Amount of memory each of the GPUs has.
  • Whether persistence mode is enabled on each of the GPUs
  • Utilization of each of the GPUs (if I’m running something on them).
  • List of processes executing on the GPUs.
  • To query the configuration parameters of all the GPUs:
$ nvidia-smi -q

I use this to check:

  • Default clocks (listed under Default Application Clocks).
  • Current clocks (listed under Application Clocks).
  • To query the configuration parameters of a particular GPU, use its index:
$ nvidia-smi -q -i 0

How to view GPU topology

The NVIDIA System Management Interface tool is the easiest way to explore the GPU topology on your system. This tool is available as nvidia-smi and is installed as part of the NVIDIA display driver. GPU topology describes how one or more GPUs in the system are connected to each other and to the CPU and other devices in the system. The topology is important to know how data is copied between GPUs or between a GPU and CPU or other device.

  • To view the available commands related to GPU topology:
$ nvidia-smi topo -h
  • To view the connection matrix between the GPUs and the CPUs they are close to (CPU affinities):
$ nvidia-smi topo -m

Some examples of GPU topologies can be seen here.

How to make CUDA and nvidia-smi use same GPU ID

One of the irritating problems I encounter while working with CUDA programs is the GPU ID. This is the identifier used to associate an integer with a GPU on the system. This is just 0 if you have one GPU in the computer. But when dealing with a system having multiple GPUs, the GPU ID that is used by CUDA and GPU ID used by non-CUDA programs like nvidia-smi are different! CUDA tries to associate the fastest GPU with the lowest ID. Non-CUDA tools use the PCI Bus ID of the GPUs to give them a GPU ID.

One solution that I was using was to use cuda-smi that shows GPU information using CUDA GPU IDs.

There is a better solution: requesting CUDA to use the same PCI Bus ID enumeration order as used by non-CUDA programs. To do this set the CUDA_​DEVICE_​ORDER environment variable to PCI_BUS_ID in your shell. The default value of this variable is FASTEST_FIRST. More info on this can be found here. Note that this is available only in CUDA 7 and later.

Reference: CUDA Environment Variables

watch

The greatest power of Unix-like systems comes from its myriad tools that are built for composability. Watch is one such excellent tool that can be used to periodically monitor by repeatedly running a given program. Some programs, like top, have in-built capability to rewrite the console and show progress with new data. With watch, you can bring to bear this ability on any tool that writes to the console!

  • This tool should be there on most systems. If not, installing is easy:
$ sudo apt install procps
  • Example usage: nvidia-smi prints the memory, temperature and frequency of GPUs on a machine. To watch the usage in realtime, updated once every 0.1 second:
$ watch -n 0.1 nvidia-smi

Tried with: ProcPs 3.3.9 and Ubuntu 14.04