K8S
AI
模型训练与微调
验证 PytorchJob

验证 PytorchJob

提交一个 PytorchJob 来运行 MNIST 数据集的训练。

kubectl apply -f - << EOF
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorch-simple
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
EOF

查看 PytorchJob 状态:

kubectl get pytorchjobs
kubectl describe pytorchjobs

查看 Pod 状态:

kubectl get pods
kubectl describe pod pytorch-simple-master-0

查看 Pod 日志:

kubectl logs pytorch-simple-master-0

删除 PytorchJob:

kubectl delete pytorchjobs pytorch-simple