验证 PytorchJob
提交一个 PytorchJob 来运行 MNIST 数据集的训练。
kubectl apply -f - << EOF
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorch-simple
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
imagePullPolicy: Always
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
imagePullPolicy: Always
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
EOF
查看 PytorchJob 状态:
kubectl get pytorchjobs
kubectl describe pytorchjobs
查看 Pod 状态:
kubectl get pods
kubectl describe pod pytorch-simple-master-0
查看 Pod 日志:
kubectl logs pytorch-simple-master-0
删除 PytorchJob:
kubectl delete pytorchjobs pytorch-simple