K8S
AI
模型训练与微调
验证 MPIJob

验证 MPIJob

创建一个 MPIJob 来运行 Pi 计算。

kubectl apply -f - << EOF
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: pi
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
    ttlSecondsAfterFinished: 3600
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: mpioperator/mpi-pi:openmpi
            name: mpi-launcher
            securityContext:
              runAsUser: 1000
            command:
            - mpirun
            args:
            - -n
            - "2"
            - /home/mpiuser/pi
            resources:
              limits:
                cpu: 1
                memory: 1Gi
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: mpioperator/mpi-pi:openmpi
            name: mpi-worker
            securityContext:
              runAsUser: 1000
            command:
            - /usr/sbin/sshd
            args:
            - -De
            - -f
            - /home/mpiuser/.sshd_config
            resources:
              limits:
                cpu: 1
                memory: 1Gi
EOF

查看 MPIJob 状态:

kubectl get mpijob
kubectl describe mpijob pi

查看 Pod 状态:

kubectl get pods
kubectl describe pod pi-launcher-xxxxx
kubectl logs pi-launcher-xxxxx

删除 MPIJob:

kubectl delete mpijob pi

更多使用指南: