š 2019-Apr-26 ⬩ āļø Ashwin Nanjappa ⬩ š·ļø cheatsheet, nvidia-smi ⬩ š Archive
nvidia-smi (NVIDIA System Management Interface) is a tool to query, monitor and configure NVIDIA GPUs. It ships with and is installed along with the NVIDIA driver and it is tied to that specific driver version. It is a tool written using the NVIDIA Management Library (NVML).
$ nvidia-smi
This outputs a summary table, where I find the following information useful:
$ nvidia-smi -q
Some of the information I find useful in this is:
To query the parameters of a particular GPU, use its index:
$ nvidia-smi -q -i 9
$ nvidia-smi -q -d UTILIZATION,CLOCK
To list the supported pairs of memory and graphics clock values:
$ nvidia-smi -q -d SUPPORTED_CLOCKS
$ nvidia-smi -q -d SUPPORTED_CLOCKS -i 3
Typically, there are only a few supported memory clock values, while the number of supported graphics clock values is high with a fine granularity.
$ nvidia-smi --query-gpu=property1,property2,property3 --format=csv
$ nvidia-smi --help-query-gpu
You must be able to find pretty much every detail of your GPU categorized under one of the properties.
$ nvidia-smi --query-gpu=property1,property2 --format=csv --loop-ms=100
$ nvidia-smi --query-gpu=pci.device_id --format=csv
pci.device_id
0x1D8110DE
In the above output, 0x1D81
is the device ID and 0x10DE
is the NVIDIA vendor ID.
$ nvidia-smi --query-gpu=pci.bus_id --format=csv
pci.bus_id
00000000:01:00.0
00000000:47:00.0
00000000:81:00.0
00000000:C2:00.0
$ nvidia-smi -i 0 -pl 250
If you try to set an invalid power limit, the command will complain and not do it. This command also seems to disable persistence mode, so you will need to enable it again. You may also need to set the GPU after this change.
gpu_name: Name of GPU
vbios_version: Version of VBIOS
compute_cap: CUDA compute capability
# Power
power.min_limit: Minimum power limit that can be set for GPU (watts)
power.max_limit: Maximum power limit that can be set for GPU (watts)
power.draw: Power being consumed by GPU at this moment (watts)
# Temperature
temperature.gpu: GPU temperature (C)
temperature.memory: HBM memory temperature (C)
# Current clock values
clocks.current.sm
clocks.current.memory
clocks.current.graphics
clocks.current.video
utilization.gpu: Percent of sampling interval time that GPU was being used.
utilization.memory: Percent of sampling interval time that device memory was being used.
clocks_throttle_reasons.sw_power_cap: Clock is below requested clock because GPU is using too much power.
clocks_throttle_reasons.sw_thermal_slowdown: Clock is below requested clock because GPU temperature is too high. (Indicates better cooling needed.)
persistence_mode: Current persistence mode
ecc.mode.current: Current ECC mode
mig.mode.current: Current MIG mode
$ nvidia-smi dmon
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 27 47 44 0 0 0 0 850 135
1 25 33 - 0 1 0 0 405 300
0 27 47 44 0 0 0 0 850 135
1 25 33 - 0 1 0 0 405 300
$ nvidia-smi dmon -s pc
# gpu pwr gtemp mtemp mclk pclk
# Idx W C C MHz MHz
0 27 46 44 850 135
1 25 33 - 405 300
0 27 47 44 850 135
1 30 33 - 405 300
$ nvidia-smi dmon -h
GPU statistics are displayed in scrolling format with one line
per sampling interval. Metrics to be monitored can be adjusted
based on the width of terminal window. Monitoring is limited to
a maximum of 4 devices. If no devices are specified, then up to
first 4 supported devices under natural enumeration (starting
with GPU index 0) are used for monitoring purpose.
It is supported on Tesla, GRID, Quadro and limited GeForce products
for Kepler or newer GPUs under x64 and ppc64 bare metal Linux.
Usage: nvidia-smi dmon [options]
Options include:
[-i | --id]: Comma separated Enumeration index, PCI bus ID or UUID
[-d | --delay]: Collection delay/interval in seconds [default=1sec]
[-c | --count]: Collect specified number of samples and exit
[-s | --select]: One or more metrics [default=puc]
Can be any of the following:
p - Power Usage and Temperature
u - Utilization
c - Proc and Mem Clocks
v - Power and Thermal Violations
m - FB and Bar1 Memory
e - ECC Errors and PCIe Replay errors
t - PCIe Rx and Tx Throughput
[-o | --options]: One or more from the following:
D - Include Date (YYYYMMDD) in scrolling output
T - Include Time (HH:MM:SS) in scrolling output
[-f | --filename]: Log to a specified file, rather than to stdout
[-h | --help]: Display help information
To enable ECC:
$ nvidia-smi -e 1
To disable ECC:
$ nvidia-smi -e 0
Note that no matter what clock you lock the GPU on (even maximum), GPU Boost might lower the clocks to stay within the power cap and thermal limits of the GPU.
$ sudo nvidia-smi -rac
$ sudo nvidia-smi -i 9 -rac
$ sudo nvidia-smi -rgc
$ sudo nvidia-smi -i 9 -rgc
$ sudo nvidia-smi -pm 0
$ sudo nvidia-smi -i 9 -pm 0
$ sudo nvidia-smi -pm 1
$ sudo nvidia-smi -i 9 -pm 1
It is recommended to enable persistence mode before locking clocks.
$ sudo nvidia-smi --auto-boost-default=DISABLED
It is recommended to do this before locking clocks.
mem,sm
format:$ sudo nvidia-smi -i 9 -ac 1215,900
$ sudo nvidia-smi -i 9 -lgc 900
$ nvidia-smi -r
$ nvidia-smi -acp UNRESTRICTED
nvidia-smi commands to deal with MIG can be found in my post How to use MIG.