Controlla NVLink® in Linux
Si prega di installare i driver NVIDIA® seguendo la nostra guida Installare il driver NVIDIA® su Linux, prima di verificare il supporto NVLink® nel sistema operativo. Inoltre, è necessario installare il toolkit CUDA® per compilare i campioni di applicazioni. In questa piccola guida, abbiamo raccolto alcuni comandi utili che è possibile utilizzare.
Comandi base
Controlla la topologia fisica del tuo sistema. Questo comando mostra tutte le GPU e i loro interconnessioni:
nvidia-smi topo -mSe vuoi visualizzare lo stato dei collegamenti, esegui il seguente comando:
nvidia-smi nvlink -sIl comando visualizza la velocità di ogni collegamento o
nvidia-smi nvlink -i 0 -cSenza questa opzione, verranno visualizzate informazioni su tutte le connessioni delle GPU :
nvidia-smi nvlink -cInstalla CUDA-samples
Un buon modo per testare la larghezza di banda è utilizzare i campioni di applicazioni di NVIDIA®. Il codice sorgente di questi campioni è postato su GitHub ed è disponibile per tutti. Procedere con il cloning del repository sul server:
git clone https://github.com/NVIDIA/cuda-samples.gitCambia directory per il repository scaricato:
cd cuda-samplesSeleziona il ramo appropriato tramite il tag in base alla versione CUDA® installata. Ad esempio, se hai CUDA® 12.2:
git checkout tags/v12.2Installa alcuni prerequisiti che saranno utilizzati nel processo di compilazione:
sudo apt -y install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-devOra, puoi compilare qualsiasi campione. Vai alla directory Samples:
cd SamplesDare una occhiata veloce al contenuto:
ls -la
total 40
drwxrwxr-x 10 usergpu usergpu 4096 Sep 13 14:54 .
drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 ..
drwxrwxr-x 55 usergpu usergpu 4096 Sep 13 14:54 0_Introduction
drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 1_Utilities
drwxrwxr-x 36 usergpu usergpu 4096 Sep 13 14:54 2_Concepts_and_Techniques
drwxrwxr-x 25 usergpu usergpu 4096 Sep 13 14:54 3_CUDA_Features
drwxrwxr-x 41 usergpu usergpu 4096 Sep 13 14:54 4_CUDA_Libraries
drwxrwxr-x 52 usergpu usergpu 4096 Sep 13 14:54 5_Domain_Specific
drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 6_Performance
drwxrwxr-x 11 usergpu usergpu 4096 Sep 13 14:54 7_libNVVM
Proviamo la larghezza di banda della GPU. Cambia la directory:
cd 1_Utilities/bandwidthTestCompila l'app:
makeEsegui i test
Inizia i test eseguendo l'applicazione usando il suo nome:
./bandwidthTestL'output potrebbe apparire così:
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: NVIDIA RTX A6000
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 6.0
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 6.6
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 569.2
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
In alternativa, puoi compilare e avviare il p2pBandwidthLatencyTest:
cd 5_Domain_Specific/p2pBandwidthLatencyTestmake./p2pBandwidthLatencyTestQuesta applicazione ti mostrerà informazioni dettagliate sulla larghezza di banda della tua GPU in modalità P2P. Esempio di output:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A6000, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 590.51 6.04
1 6.02 590.51
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 589.40 52.75
1 52.88 592.53
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 593.88 8.55
1 8.55 595.32
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 595.69 101.68
1 101.97 595.69
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.61 28.66
1 18.49 1.53
CPU 0 1
0 2.27 6.06
1 6.12 2.23
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.62 1.27
1 1.17 1.55
CPU 0 1
0 2.27 1.91
1 1.90 2.34
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
In caso di una configurazione con più GPU, potrebbe apparire così:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA H100 PCIe, pciBusID: 30, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA H100 PCIe, pciBusID: 3f, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA H100 PCIe, pciBusID: 40, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA H100 PCIe, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA H100 PCIe, pciBusID: b0, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA H100 PCIe, pciBusID: b1, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA H100 PCIe, pciBusID: c2, pciDeviceID: 0, pciDomainID:0
Device: 7, NVIDIA H100 PCIe, pciBusID: c3, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=0 CAN Access Peer Device=6
Device=0 CAN Access Peer Device=7
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=6
Device=1 CAN Access Peer Device=7
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=6
Device=2 CAN Access Peer Device=7
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=6
Device=3 CAN Access Peer Device=7
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=7
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=7
Device=6 CAN Access Peer Device=0
Device=6 CAN Access Peer Device=1
Device=6 CAN Access Peer Device=2
Device=6 CAN Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5
Device=6 CAN Access Peer Device=7
Device=7 CAN Access Peer Device=0
Device=7 CAN Access Peer Device=1
Device=7 CAN Access Peer Device=2
Device=7 CAN Access Peer Device=3
Device=7 CAN Access Peer Device=4
Device=7 CAN Access Peer Device=5
Device=7 CAN Access Peer Device=6
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1629.83 38.43 38.39 37.66 38.51 38.19 38.09 37.92
1 38.22 1637.04 35.52 35.59 38.15 38.38 38.08 37.55
2 37.76 35.62 1635.32 35.45 38.59 38.21 38.77 37.94
3 37.88 35.50 35.60 1639.45 38.49 37.43 38.72 38.49
4 36.87 37.03 37.00 36.90 1635.86 34.48 38.06 37.22
5 37.27 37.06 36.92 37.06 34.51 1636.18 37.80 37.50
6 37.05 36.95 37.45 37.15 37.51 37.96 1630.79 34.94
7 36.98 36.91 36.95 36.87 37.83 38.02 34.73 1633.35
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1635.22 34.42 33.84 256.54 27.74 28.68 28.00 28.41
1 34.66 1636.93 256.16 17.97 71.58 71.64 71.65 71.61
2 34.78 256.81 1655.79 30.29 70.34 70.42 70.37 70.33
3 256.65 30.65 70.67 1654.53 70.66 70.69 70.70 70.73
4 28.26 30.80 69.99 70.04 1630.36 256.45 69.97 70.02
5 28.10 31.08 71.60 71.59 256.47 1654.31 71.62 71.54
6 28.37 30.96 70.99 70.93 70.91 70.96 1632.12 257.11
7 27.66 30.87 70.30 70.40 70.30 70.39 256.72 1649.57
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1673.16 51.88 51.95 51.76 51.61 51.44 52.07 51.30
1 52.04 1676.28 39.06 39.21 51.62 51.62 51.98 51.36
2 52.11 39.27 1674.62 39.16 51.42 51.21 51.72 51.71
3 51.74 39.70 39.22 1672.77 51.50 51.27 51.70 51.24
4 52.14 52.41 51.38 52.14 1671.54 38.81 46.76 45.72
5 51.82 52.65 52.30 51.67 38.57 1676.33 46.90 45.96
6 52.92 52.66 53.02 52.68 46.23 46.31 1672.74 38.91
7 52.61 52.74 52.79 52.64 45.90 46.35 39.07 1673.16
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1670.31 52.41 140.69 508.68 139.85 141.88 141.71 140.55
1 141.69 1673.30 509.23 141.22 139.91 143.28 141.71 140.61
2 140.64 508.90 1669.67 140.68 139.93 140.61 140.67 140.50
3 509.14 141.36 140.61 1682.65 139.93 141.45 141.45 140.67
4 140.01 140.03 140.07 139.94 1670.68 508.37 140.01 139.90
5 141.92 143.17 140.50 141.19 508.92 1670.73 141.72 140.52
6 141.72 141.72 140.60 141.31 139.66 141.85 1671.51 510.03
7 140.62 140.71 140.66 140.63 140.02 140.72 509.77 1668.28
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 2.35 17.23 17.13 13.38 12.86 21.15 21.39 21.12
1 17.54 2.32 12.95 13.78 21.05 21.23 21.31 21.37
2 16.85 14.83 2.35 16.07 12.71 12.80 21.23 12.79
3 14.98 16.06 14.64 2.41 13.35 12.81 13.60 21.36
4 21.31 21.31 20.49 21.32 2.62 12.33 12.66 12.98
5 20.36 21.22 20.17 12.79 16.74 2.58 12.41 12.93
6 17.51 12.84 12.79 12.70 17.63 18.78 2.36 13.69
7 21.23 12.71 19.41 21.09 14.69 13.79 15.52 2.59
CPU 0 1 2 3 4 5 6 7
0 1.73 4.99 4.88 4.85 5.17 5.18 5.18 5.33
1 5.04 1.71 4.74 4.82 5.04 5.14 5.10 5.19
2 4.86 4.75 1.66 4.78 5.08 5.09 5.11 5.17
3 4.80 4.72 4.73 1.63 5.09 5.11 5.06 5.10
4 5.07 5.00 5.03 4.96 1.77 5.33 5.34 5.38
5 5.12 4.94 5.00 4.96 5.31 1.77 5.38 5.41
6 5.09 4.97 5.09 5.01 5.35 5.39 1.80 5.42
7 5.18 5.09 5.02 5.00 5.39 5.40 5.40 1.76
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 2.33 2.15 2.11 2.76 2.07 2.11 2.07 2.12
1 2.07 2.30 2.77 2.07 2.12 2.06 2.06 2.10
2 2.09 2.75 2.34 2.12 2.09 2.08 2.08 2.12
3 2.78 2.10 2.13 2.40 2.13 2.14 2.14 2.13
4 2.18 2.23 2.23 2.17 2.59 2.82 2.15 2.16
5 2.15 2.17 2.15 2.20 2.82 2.56 2.17 2.16
6 2.13 2.18 2.21 2.17 2.15 2.17 2.36 2.85
7 2.19 2.21 2.19 2.22 2.19 2.19 2.86 2.61
CPU 0 1 2 3 4 5 6 7
0 1.78 1.32 1.29 1.40 1.33 1.34 1.34 1.33
1 1.32 1.69 1.34 1.35 1.35 1.34 1.40 1.33
2 1.38 1.37 1.73 1.36 1.36 1.35 1.35 1.34
3 1.34 1.42 1.35 1.66 1.34 1.34 1.35 1.33
4 1.53 1.41 1.40 1.40 1.77 1.43 1.48 1.47
5 1.46 1.43 1.43 1.42 1.47 1.84 1.51 1.56
6 1.53 1.45 1.45 1.45 1.45 1.44 1.85 1.47
7 1.54 1.47 1.47 1.47 1.45 1.44 1.50 1.84
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Guarda anche:
Aggiornato: 28.03.2025
Pubblicato: 06.05.2024