NVIDIA NCCL multi-GPU optimization
NCCL summary
The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.
Here are some key points about NCCL:
- Function: Enables high-speed communication between GPUs residing on a single node or across multiple nodes.
- Communication methods: Provides optimized routines for collective communication (all-gather, all-reduce, broadcast, etc.) and point-to-point communication (send/receive).
- Performance: Leverages high-bandwidth interconnects like PCI-Express, NVLink, and InfiniBand to achieve low latency communication.
- Usability: Can be integrated into various applications, including single-process and multi-process (MPI) programs.
NVLink is a wire-based communications protocol for near-range semiconductor communications developed by Nvidia that can be used for data and control code transfers in processor systems between CPUs and GPUs and solely between GPUs.
Reference page: Accelerate distributed deep learning with OCI
Install
To install NCCL on the system, create a package then install it as root.
Debian/Ubuntu
1 2 3 4 5 |
|
RedHat/CentOS
1 2 3 4 5 |
|
Tests
Tests for NCCL are maintained separately at https://github.com/nvidia/nccl-tests.
1 2 3 4 |
|
NCCL Analysis
We can use nvidia nsight system
analysis the performance.
捐赠本站(Donate)
如您感觉文章有用,可扫码捐赠本站!(If the article useful, you can scan the QR code to donate))