使用BentoML来部署推理服务
本文演示如何使用BentoML (opens in a new tab)来部署推理服务,同时顺带介绍下用Gradio (opens in a new tab)来搭建一个UI来访问相应的推理服务。
应用概述
本应用是一个图生文的应用,模型采用Qwen-VL-Chat。通过BentoML为模型提供外部可访问的Restful接口,然后再利用Gradio来制作一个界面,方便用户操作。下面分几个部分来进行介绍:推理镜像制作,UI镜像制作, k8s部署和运用使用。
推理镜像制作
BentoML是一个提供模型推理服务的框架,包括服务接口定制、输入/输出参数定义和校验、容器化、发布、测试、客户端调用、可观测配置等很多功能,这里只简单使用了服务接口定制和容器化两个功能,且只是简单使用,更复杂的使用,建议参考官方文档。
- 先下载用到的模型,使用python脚本如下
from modelscope import (snapshot_download)
model_id = 'qwen/Qwen-VL-Chat'
revision = 'v1.1.0'
model_dir = snapshot_download(model_id, revision=revision)
print(f"model_dir is {model_dir}")
- 然后写一个python脚本来生成BentoML的model,如下
import shutil
import bentoml
local_model_dir = '/home/zhoujg/.cache/modelscope/hub/qwen/Qwen-VL-Chat'
with bentoml.models.create(
name='my-local-qwen-vl-chat', # Name of the model in the Model Store
) as model_ref:
# Copy the entire model directory to the BentoML Model Store
shutil.copytree(local_model_dir, model_ref.path, dirs_exist_ok=True)
print(f"Model saved: {model_ref}")
执行此python, 然后通过如下命令看是否存到BentoML的models里了
bentoml models list
执行完可看到
Tag Module Size Creation Time
my-local-qwen-vl-chat:rui7y4snt6j6qaaw 18.00 GiB 2024-07-29 19:41:53
- 接着写一个python文件来定义一个模型的服务接口,也就是Restful 接口,如下所示。如对模型调用熟悉的话,代码和使用jupyter来调用模型很类似,所以说BentoML对开发人员很友好,更容易上手。代码里用到的model名是在Bentos定义的模型别名。
from __future__ import annotations
import bentoml
from modelscope import (AutoModelForCausalLM, AutoTokenizer, GenerationConfig)
import torch
# 加载放到bentoml的models的模型
model_dir = bentoml.models.get("my-local-qwen-vl-chat")
torch.manual_seed(1234)
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
@bentoml.service(
resources={"gpu": 1}, # 如无GPU资源,可注释掉
traffic={"timeout": 10},
)
class QwenVLChat:
# backend 模型初始化
def __init__(self) -> None:
self.model = AutoModelForCausalLM.from_pretrained(
model_dir, device_map="auto", trust_remote_code=True).eval()
self.model.generation_config = GenerationConfig.from_pretrained(
model_dir, trust_remote_code=True)
# restful接口定义,有两个参数,这里没加参数校验,实际bentoml是有相关注解的
@bentoml.api
def generate(self, image_path: str, do_what: str) -> str:
query = tokenizer.from_list_format([{
'image': image_path
}, {
'text': do_what
}])
response, history = self.model.chat(tokenizer,
query=query,
history=None)
return response
- 接下来进行推理镜像的制作,先定义一个yaml文件用来生成Bentos(BentoML的部署文件集,每个有自己的version),里面用到的service就是上面定义的服务接口。如下
service: 'my_bentoml_service:QwenVLChat'
labels:
owner: paratera-bds-team
project: scclabs-example
include:
- 'my_bentoml_service.py'
models:
- "my-local-qwen-vl-chat:latest"
- tag: "my-local-qwen-vl-chat:rui7y4snt6j6qaaw" # 上面生成的BentoML model名
alias: "my-local-qwen-vl-chat" # 模型别名,便于service引用
python:
requirements_txt: "./requirements.txt"
lock_packages: false
index_url: "https://mirrors.bfsu.edu.cn/pypi/web/simple"
# no_index: False
trusted_host:
- "mirrors.bfsu.edu.cn"
docker:
distro: debian
python_version: "3.10"
yaml文件定义完,先用如下命令制作Bento文件
bentoml build
当命令执行成功,会生成一个Bento文件,如下
██████╗ ███████╗███╗ ██╗████████╗ ██████╗ ███╗ ███╗██╗
██╔══██╗██╔════╝████╗ ██║╚══██╔══╝██╔═══██╗████╗ ████║██║
██████╔╝█████╗ ██╔██╗ ██║ ██║ ██║ ██║██╔████╔██║██║
██╔══██╗██╔══╝ ██║╚██╗██║ ██║ ██║ ██║██║╚██╔╝██║██║
██████╔╝███████╗██║ ╚████║ ██║ ╚██████╔╝██║ ╚═╝ ██║███████╗
╚═════╝ ╚══════╝╚═╝ ╚═══╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝╚══════╝
Successfully built Bento(tag="qwen_vl_chat:okjy4hcnnoineaav").
接着就用如下指令来生成容器镜像 *** 在执行如下命令生成docker镜像前,需要做一些trick来加速镜像生成,因为bentoml没有在模板里加上pip镜像源,在进行预热加载包时会很慢,所以需要修改base.j2 (此文件在~/miniforge3/envs/[your-conda-env-name]/lib/python3.10/site-packages/bentoml/_internal/container/frontend/dockerfile/templates)
# 在line 66 加上pip镜像源
{% call common.RUN(__enable_buildkit__) -%} {{ __pip_cache__ }} {% endcall -%} pip3 install {{value | bash_quote}} -i https://mirrors.bfsu.edu.cn/pypi/web/simple; exit 0
*** 如在服务器安装,还需用命令 apt install docker-buildx 安装docker-buildx
生成容器镜像的指令,生成完通过docker images可以看到因为集成了模型,镜像文件很大
bentoml build --containerize
# or bentoml containerize qwen_vl_chat:3x3x6psogkvnkaaw
- 接下来把镜像推到私服里
docker tag qwen_vl_chat:3x3x6psogkvnkaaw cr.zw1.paratera.com/zhoujg/aps-explore-model-infer:latest
docker push cr.zw1.paratera.com/zhoujg/aps-explore-model-infer:latest
UI镜像制作
这个UI是为了帮助用户使用模型而开发的应用,主要用到Gradio技术。
- 核心代码如下。用到了BentoML的client端API,说明这个架构很完整。从server到client都提供了支持。其实用python的requests来实现提交请求也是可行的。
import gradio as gr
import os
import bentoml
os.environ['GRADIO_TEMP_DIR'] = '/data/gradio-data'
def generate(img, what):
# print(f"==========>> {img}")
with bentoml.SyncHTTPClient("http://localhost:3000") as client:
result = client.generate(
# image_path="/workspace/bds/imgs/horse.jpg",
image_path=img,
do_what=what)
return result
# return ""
# 创建 Gradio 界面
with gr.Blocks(title="图生文示例-BentoML演示") as demo:
with gr.Row():
with gr.Column():
img_show = gr.Image(label="上传图片", type='filepath', height=480)
do_what = gr.Text(label="图生文的要求", placeholder="")
button = gr.Button("提交")
with gr.Column():
text_area = gr.TextArea(label="文字输出")
button.click(fn=generate, inputs=[img_show, do_what], outputs=text_area)
gr.Examples(
fn=generate,
examples=[["./example/horse.jpg", "请用曹雪芹文体描述下图中内容,不少于200字"],
["./example/horse.jpg", "请用鲁迅文体描述下图中内容,不少于200字"],
["./example/horse.jpg", "请用文言文文体描述下图中内容,不少于200字"]],
inputs=[img_show, do_what],
outputs=text_area,
# cache_examples=True
run_on_click=True)
demo.launch(server_name="0.0.0.0", server_port=3100)
- 下面是Dockerfile
FROM python:3.10-slim
USER root
RUN pip config set global.index-url https://mirrors.bfsu.edu.cn/pypi/web/simple \
&& pip config set install.trusted-host mirrors.bfsu.edu.cn
RUN mkdir -p /root/code && mkdir -p /root/example
COPY requirements.txt /root
COPY code/*.py /root/code
COPY example/*.jpg /root/example
WORKDIR /root
RUN cd /root
RUN pip install -r requirements.txt
EXPOSE 3100
ENTRYPOINT [ "sh", "-c", "python code/img2text.py" ]
接着制作镜像, 并推到私服上
docker build --no-cache -t aps-explore-biz-show .
docker tag aps-explore-biz-show:latest cr.zw1.paratera.com/zhoujg/aps-explore-biz-show:latest
docker push cr.zw1.paratera.com/zhoujg/aps-explore-biz-show:latest
k8s部署
这里将推理服务和UI服务都部署在一个pod里,下面是相关的配置文件。
将两个服务都放到一个pod里,通过container来区分
apiVersion: apps/v1
kind: Deployment
metadata:
name: bentoml
spec:
strategy:
type: Recreate
selector:
matchLabels:
app: bentoml
replicas: 1
template:
metadata:
labels:
app: bentoml
spec:
containers:
- name: aps-explore-model-infer
image: cr.zw1.paratera.com/zhoujg/aps-explore-model-infer:latest
args: ["serve", "--host", "0.0.0.0", "--port", "3000"]
# command: ["/bin/sh"]
# args: ["-c", "while true; do echo hello; sleep 10;done"]
ports:
- containerPort: 3000
resources:
limits:
cpu: 2
memory: 6Gi
nvidia.com/gpu: 1
volumeMounts:
- name: data
subPath: bentoml
mountPath: "/data"
- name: aps-explore-biz-show
image: cr.zw1.paratera.com/zhoujg/aps-explore-biz-show:latest
ports:
- containerPort: 3100
resources:
limits:
cpu: 1
memory: 2Gi
volumeMounts:
- name: data
subPath: bentoml
mountPath: "/data"
volumes:
- name: data
persistentVolumeClaim:
claimName: jupyter
tolerations:
- key: nvidia.com/gpu
operator: Exists
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu
operator: In
values:
- rtx-3090
apiVersion: v1
kind: Service
metadata:
name: bentoml
spec:
selector:
app: bentoml
type: ClusterIP
ports:
- protocol: TCP
name: http-aps-explore-model-infer
port: 3000
targetPort: 3000
- protocol: TCP
name: http-aps-explore-biz-show
port: 3100
targetPort: 3100
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: bentoml
spec:
ingressClassName: nginx
rules:
- host: aps-explore-img2txt.poc1-be9e3e9b62c8.ing.zw1.paratera.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: bentoml
port:
number: 3100
通过kustomize来部署
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- ingress.yaml
运用使用
界面效果如下,提交图片,然后输入图生文要求,点击提交按钮就可输出不同的文字。点击Examples下的例子可按示例进行演示。