跳转至

NVIDIA NCCL multi-GPU optimization

NCCL

NCCL summary

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.

Here are some key points about NCCL:

  • Function: Enables high-speed communication between GPUs residing on a single node or across multiple nodes.
  • Communication methods: Provides optimized routines for collective communication (all-gather, all-reduce, broadcast, etc.) and point-to-point communication (send/receive).
  • Performance: Leverages high-bandwidth interconnects like PCI-Express, NVLink, and InfiniBand to achieve low latency communication.
  • Usability: Can be integrated into various applications, including single-process and multi-process (MPI) programs.

NVLink is a wire-based communications protocol for near-range semiconductor communications developed by Nvidia that can be used for data and control code transfers in processor systems between CPUs and GPUs and solely between GPUs.

NCCL RDMA

Reference page: Accelerate distributed deep learning with OCI

Install

To install NCCL on the system, create a package then install it as root.

Debian/Ubuntu

1
2
3
4
5
$ # Install tools to create debian packages
$ sudo apt install build-essential devscripts debhelper fakeroot
$ # Build NCCL deb package
$ make pkg.debian.build
$ ls build/pkg/deb/

RedHat/CentOS

1
2
3
4
5
$ # Install tools to create rpm packages
$ sudo yum install rpm-build rpmdevtools
$ # Build NCCL rpm package
$ make pkg.redhat.build
$ ls build/pkg/rpm/

Tests

Tests for NCCL are maintained separately at https://github.com/nvidia/nccl-tests.

1
2
3
4
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ make
$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g <ngpus>

NCCL Analysis

nccl analysis

We can use nvidia nsight system analysis the performance.

nvidia nsight system

捐赠本站(Donate)

weixin_pay
如您感觉文章有用,可扫码捐赠本站!(If the article useful, you can scan the QR code to donate))