使用 MPIJob 运行 16 卡 H100 的 NCCL 测试

提交 MPIJob：

kubectl apply -f - <<EOF
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nccl-tests-h100
spec:
  slotsPerWorker: 8
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
            - image: ghcr.io/coreweave/nccl-tests:12.2.2-cudnn8-devel-ubuntu20.04-nccl2.19.3-1-3e0fbc3
              name: nccl
              env:
                - name: OMPI_ALLOW_RUN_AS_ROOT
                  value: "1"
                - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                  value: "1"
              # Uncomment to be able to exec in to launcher pod for interactive testing
              # command: ['sleep', '86400']
              command: ["/bin/bash", "-c"]
              args: [
                  "mpirun \
                  -np 16 \ 
                  -bind-to none \
                  -x LD_LIBRARY_PATH \
                  -x NCCL_SOCKET_IFNAME=eth0 \
                  -x NCCL_IB_HCA=mlx5 \
                  -x NCCL_ALGO=Ring \
                  -x NCCL_IB_QPS_PER_CONNECTION=4 \
                  -x NCCL_CROSS_NIC=1 \
                  /opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1 \
                  ",
                ]
 
              resources:
                limits:
                  cpu: 2
                  memory: 4Gi
          enableServiceLinks: false
          automountServiceAccountToken: false
    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            job: nccl-test
        spec:
          containers:
            - image: ghcr.io/coreweave/nccl-tests:12.2.2-cudnn8-devel-ubuntu20.04-nccl2.19.3-1-3e0fbc3
              name: nccl
              resources:
                limits:
                  cpu: 160
                  memory: 1920Gi
                  nvidia.com/gpu: 8
                  rdma/hca: 1
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
          volumes:
            - emptyDir:
                medium: Memory
              name: dshm
          tolerations:
          - key: nvidia.com/gpu
            operator: Exists
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: nvidia.com/gpu
                        operator: In
                        values:
                          - h100-nvlink-80gb
          enableServiceLinks: false
          automountServiceAccountToken: false
EOF

当前平台暂未提供 H100 。

查看 MPIJob 状态：

kubectl get mpijob
kubectl describe mpijob nccl-tests-h100

查看 NCCL 测试结果：

kubectl logs nccl-tests-h100-launcher-xxxxx

测试参考结果：

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
536870912     134217728     float     sum      -1   2622.9  204.69  383.79      0   2622.4  204.72  383.86      0
1073741824     268435456     float     sum      -1   5200.4  206.47  387.14      0   5199.4  206.51  387.21      0
2147483648     536870912     float     sum      -1    10308  208.32  390.61      0    10389  206.70  387.57      0
4294967296    1073741824     float     sum      -1    20528  209.22  392.29      0    20530  209.21  392.27      0
8589934592    2147483648     float     sum      -1    41145  208.77  391.45      0    41008  209.47  392.76      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 388.893 
#

使用 MPIJob 运行 8 卡 3090 的 NCCL 测试