Code Yarns ā€šŸ‘Øā€šŸ’»
Tech Blog ā– Personal Blog

nvidia-smi cheatsheet

šŸ“… 2019-Apr-26 ā¬© āœļø Ashwin Nanjappa ā¬© šŸ·ļø cheatsheet, nvidia-smi ā¬© šŸ“š Archive

nvidia-smi (NVIDIA System Management Interface) is a tool to query, monitor and configure NVIDIA GPUs. It ships with and is installed along with the NVIDIA driver and it is tied to that specific driver version. It is a tool written using the NVIDIA Management Library (NVML).

Query status of GPUs

$ nvidia-smi

This outputs a summary table, where I find the following information useful:

Query parameters of GPUs

$ nvidia-smi -q

Some of the information I find useful in this is:

To query the parameters of a particular GPU, use its index:

$ nvidia-smi -q -i 9

Query utilization and clocks

$ nvidia-smi -q -d UTILIZATION,CLOCK

Query supported clock values

To list the supported pairs of memory and graphics clock values:

$ nvidia-smi -q -d SUPPORTED_CLOCKS
$ nvidia-smi -q -d SUPPORTED_CLOCKS -i 3

Typically, there are only a few supported memory clock values, while the number of supported graphics clock values is high with a fine granularity.

Query current properties of GPUs

$ nvidia-smi --query-gpu=property1,property2,property3 --format=csv
$ nvidia-smi --help-query-gpu

You must be able to find pretty much every detail of your GPU categorized under one of the properties.

$ nvidia-smi --query-gpu=property1,property2 --format=csv --loop-ms=100
$ nvidia-smi --query-gpu=pci.device_id --format=csv
pci.device_id
0x1D8110DE

In the above output, 0x1D81 is the device ID and 0x10DE is the NVIDIA vendor ID.

$ nvidia-smi -i 0 -pl 250

If you try to set an invalid power limit, the command will complain and not do it. This command also seems to disable persistence mode, so you will need to enable it again.

gpu_name: Name of GPU
vbios_version: Version of VBIOS

# Power
power.min_limit: Minimum power limit that can be set for GPU (watts)
power.max_limit: Maximum power limit that can be set for GPU (watts)
power.draw: Power being consumed by GPU at this moment (watts)

# Temperature
temperature.gpu: GPU temperature (C)
temperature.memory: HBM memory temperature (C)

# Current clock values
clocks.current.sm
clocks.current.memory
clocks.current.graphics
clocks.current.video

utilization.gpu: Percent of sampling interval time that GPU was being used.
utilization.memory: Percent of sampling interval time that device memory was being used.

clocks_throttle_reasons.sw_power_cap: Clock is below requested clock because GPU is using too much power.
clocks_throttle_reasons.sw_thermal_slowdown: Clock is below requested clock because GPU temperature is too high. (Indicates better cooling needed.)

Display GPU statistics in scrolling format

$ nvidia-smi dmon

# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    27    47    44     0     0     0     0   850   135
    1    25    33     -     0     1     0     0   405   300
    0    27    47    44     0     0     0     0   850   135
    1    25    33     -     0     1     0     0   405   300
$ nvidia-smi dmon -s pc

# gpu   pwr gtemp mtemp  mclk  pclk
# Idx     W     C     C   MHz   MHz
    0    27    46    44   850   135
    1    25    33     -   405   300
    0    27    47    44   850   135
    1    30    33     -   405   300
$ nvidia-smi dmon -h

    GPU statistics are displayed in scrolling format with one line
    per sampling interval. Metrics to be monitored can be adjusted
    based on the width of terminal window. Monitoring is limited to
    a maximum of 4 devices. If no devices are specified, then up to
    first 4 supported devices under natural enumeration (starting
    with GPU index 0) are used for monitoring purpose.
    It is supported on Tesla, GRID, Quadro and limited GeForce products
    for Kepler or newer GPUs under x64 and ppc64 bare metal Linux.

    Usage: nvidia-smi dmon [options]

    Options include:
    [-i | --id]:          Comma separated Enumeration index, PCI bus ID or UUID
    [-d | --delay]:       Collection delay/interval in seconds [default=1sec]
    [-c | --count]:       Collect specified number of samples and exit
    [-s | --select]:      One or more metrics [default=puc]
                          Can be any of the following:
                              p - Power Usage and Temperature
                              u - Utilization
                              c - Proc and Mem Clocks
                              v - Power and Thermal Violations
                              m - FB and Bar1 Memory
                              e - ECC Errors and PCIe Replay errors
                              t - PCIe Rx and Tx Throughput
    [-o | --options]:     One or more from the following:
                              D - Include Date (YYYYMMDD) in scrolling output
                              T - Include Time (HH:MM:SS) in scrolling output
    [-f | --filename]:    Log to a specified file, rather than to stdout
    [-h | --help]:        Display help information

Toggle ECC mode

To enable ECC:

$ nvidia-smi -e 1

To disable ECC:

$ nvidia-smi -e 0

Set GPU clocks

Note that no matter what clock you lock the GPU on (even maximum), GPU Boost might lower the clocks to stay within the power cap and thermal limits of the GPU.

$ sudo nvidia-smi -rac
$ sudo nvidia-smi -i 9 -rac
$ sudo nvidia-smi -rgc
$ sudo nvidia-smi -i 9 -rgc
$ sudo nvidia-smi -pm 0
$ sudo nvidia-smi -i 9 -pm 0
$ sudo nvidia-smi -pm 1
$ sudo nvidia-smi -i 9 -pm 1

It is recommended to enable persistence mode before locking clocks.

$ sudo nvidia-smi --auto-boost-default=DISABLED

It is recommended to do this before locking clocks.

$ sudo nvidia-smi -i 9 -ac 1215,900
$ sudo nvidia-smi -i 9 -lgc 900
$ nvidia-smi -r
$ nvidia-smi -acp UNRESTRICTED

MIG

nvidia-smi commands to deal with MIG can be found in my post How to use MIG.