使用BentoML来部署推理服务

本文演示如何使用BentoML (opens in a new tab)来部署推理服务，同时顺带介绍下用Gradio (opens in a new tab)来搭建一个UI来访问相应的推理服务。

应用概述

本应用是一个图生文的应用，模型采用Qwen-VL-Chat。通过BentoML为模型提供外部可访问的Restful接口，然后再利用Gradio来制作一个界面，方便用户操作。下面分几个部分来进行介绍：推理镜像制作，UI镜像制作， k8s部署和运用使用。

推理镜像制作

BentoML是一个提供模型推理服务的框架，包括服务接口定制、输入/输出参数定义和校验、容器化、发布、测试、客户端调用、可观测配置等很多功能，这里只简单使用了服务接口定制和容器化两个功能，且只是简单使用，更复杂的使用，建议参考官方文档。

先下载用到的模型，使用python脚本如下

download_model.py

from modelscope import  (snapshot_download)
 
model_id = 'qwen/Qwen-VL-Chat'
revision = 'v1.1.0'
model_dir = snapshot_download(model_id, revision=revision)
print(f"model_dir is {model_dir}")

然后写一个python脚本来生成BentoML的model，如下

create_bentoml_model.py

import shutil
import bentoml
 
local_model_dir = '/home/zhoujg/.cache/modelscope/hub/qwen/Qwen-VL-Chat'
 
with bentoml.models.create(
    name='my-local-qwen-vl-chat', # Name of the model in the Model Store
) as model_ref:
    # Copy the entire model directory to the BentoML Model Store
    shutil.copytree(local_model_dir, model_ref.path, dirs_exist_ok=True)
    print(f"Model saved: {model_ref}")

执行此python, 然后通过如下命令看是否存到BentoML的models里了

bentoml models list

执行完可看到

 Tag                                     Module  Size       Creation Time
 my-local-qwen-vl-chat:rui7y4snt6j6qaaw          18.00 GiB  2024-07-29 19:41:53

接着写一个python文件来定义一个模型的服务接口，也就是Restful 接口，如下所示。如对模型调用熟悉的话，代码和使用jupyter来调用模型很类似，所以说BentoML对开发人员很友好，更容易上手。代码里用到的model名是在Bentos定义的模型别名。

my_bentoml_service.yaml

from __future__ import annotations
import bentoml
 
from modelscope import (AutoModelForCausalLM, AutoTokenizer, GenerationConfig)
import torch
 
# 加载放到bentoml的models的模型
model_dir = bentoml.models.get("my-local-qwen-vl-chat")
torch.manual_seed(1234)
 
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
 
 
@bentoml.service(
    resources={"gpu": 1}, # 如无GPU资源，可注释掉
    traffic={"timeout": 10},
)
class QwenVLChat:
 
    # backend 模型初始化
    def __init__(self) -> None:
        self.model = AutoModelForCausalLM.from_pretrained(
            model_dir, device_map="auto", trust_remote_code=True).eval()
        self.model.generation_config = GenerationConfig.from_pretrained(
            model_dir, trust_remote_code=True)
 
    # restful接口定义，有两个参数，这里没加参数校验，实际bentoml是有相关注解的
    @bentoml.api
    def generate(self, image_path: str, do_what: str) -> str:
 
        query = tokenizer.from_list_format([{
            'image': image_path
        }, {
            'text': do_what
        }])
        response, history = self.model.chat(tokenizer,
                                            query=query,
                                            history=None)
        return response

接下来进行推理镜像的制作，先定义一个yaml文件用来生成Bentos（BentoML的部署文件集，每个有自己的version），里面用到的service就是上面定义的服务接口。如下

bentofile.yaml

service: 'my_bentoml_service:QwenVLChat'
labels:
  owner: paratera-bds-team
  project: scclabs-example
include:
  - 'my_bentoml_service.py'
models:
  - "my-local-qwen-vl-chat:latest"
  - tag: "my-local-qwen-vl-chat:rui7y4snt6j6qaaw"  # 上面生成的BentoML model名
    alias: "my-local-qwen-vl-chat"  # 模型别名，便于service引用
python:
  requirements_txt: "./requirements.txt"
  lock_packages: false
  index_url: "https://mirrors.bfsu.edu.cn/pypi/web/simple"
  # no_index: False
  trusted_host:
    - "mirrors.bfsu.edu.cn"
docker:
  distro: debian
  python_version: "3.10"

yaml文件定义完，先用如下命令制作Bento文件

bentoml build

当命令执行成功，会生成一个Bento文件，如下

██████╗ ███████╗███╗   ██╗████████╗ ██████╗ ███╗   ███╗██╗
██╔══██╗██╔════╝████╗  ██║╚══██╔══╝██╔═══██╗████╗ ████║██║
██████╔╝█████╗  ██╔██╗ ██║   ██║   ██║   ██║██╔████╔██║██║
██╔══██╗██╔══╝  ██║╚██╗██║   ██║   ██║   ██║██║╚██╔╝██║██║
██████╔╝███████╗██║ ╚████║   ██║   ╚██████╔╝██║ ╚═╝ ██║███████╗
╚═════╝ ╚══════╝╚═╝  ╚═══╝   ╚═╝    ╚═════╝ ╚═╝     ╚═╝╚══════╝
 
Successfully built Bento(tag="qwen_vl_chat:okjy4hcnnoineaav").

接着就用如下指令来生成容器镜像 *** 在执行如下命令生成docker镜像前，需要做一些trick来加速镜像生成，因为bentoml没有在模板里加上pip镜像源，在进行预热加载包时会很慢，所以需要修改base.j2 (此文件在~/miniforge3/envs/[your-conda-env-name]/lib/python3.10/site-packages/bentoml/_internal/container/frontend/dockerfile/templates)

# 在line 66 加上pip镜像源
{% call common.RUN(__enable_buildkit__) -%} {{ __pip_cache__ }} {% endcall -%} pip3 install {{value | bash_quote}} -i https://mirrors.bfsu.edu.cn/pypi/web/simple; exit 0

*** 如在服务器安装，还需用命令 apt install docker-buildx 安装docker-buildx

生成容器镜像的指令，生成完通过docker images可以看到因为集成了模型，镜像文件很大

bentoml build --containerize
# or bentoml containerize qwen_vl_chat:3x3x6psogkvnkaaw

接下来把镜像推到私服里

docker tag qwen_vl_chat:3x3x6psogkvnkaaw cr.zw1.paratera.com/zhoujg/aps-explore-model-infer:latest
docker push cr.zw1.paratera.com/zhoujg/aps-explore-model-infer:latest

UI镜像制作

这个UI是为了帮助用户使用模型而开发的应用，主要用到Gradio技术。

核心代码如下。用到了BentoML的client端API，说明这个架构很完整。从server到client都提供了支持。其实用python的requests来实现提交请求也是可行的。

img2text.py

import gradio as gr
import os
import bentoml
 
os.environ['GRADIO_TEMP_DIR'] = '/data/gradio-data'
 
 
def generate(img, what):
    # print(f"==========>> {img}")
 
    with bentoml.SyncHTTPClient("http://localhost:3000") as client:
        result = client.generate(
            # image_path="/workspace/bds/imgs/horse.jpg",
            image_path=img,
            do_what=what)
        return result
    # return ""
 
 
# 创建 Gradio 界面
with gr.Blocks(title="图生文示例-BentoML演示") as demo:
    with gr.Row():
        with gr.Column():
            img_show = gr.Image(label="上传图片", type='filepath', height=480)
            do_what = gr.Text(label="图生文的要求", placeholder="")
            button = gr.Button("提交")
 
        with gr.Column():
            text_area = gr.TextArea(label="文字输出")
 
    button.click(fn=generate, inputs=[img_show, do_what], outputs=text_area)
 
    gr.Examples(
        fn=generate,
        examples=[["./example/horse.jpg", "请用曹雪芹文体描述下图中内容，不少于200字"],
                  ["./example/horse.jpg", "请用鲁迅文体描述下图中内容，不少于200字"],
                  ["./example/horse.jpg", "请用文言文文体描述下图中内容，不少于200字"]],
        inputs=[img_show, do_what],
        outputs=text_area,
        # cache_examples=True
        run_on_click=True)
 
demo.launch(server_name="0.0.0.0", server_port=3100)

下面是Dockerfile

Dockerfile

FROM python:3.10-slim


USER root

RUN pip config set global.index-url https://mirrors.bfsu.edu.cn/pypi/web/simple \ 
   && pip config set install.trusted-host mirrors.bfsu.edu.cn
   

RUN mkdir -p /root/code && mkdir -p /root/example
COPY requirements.txt /root
COPY code/*.py /root/code
COPY example/*.jpg /root/example

WORKDIR /root
RUN cd /root
RUN pip install -r requirements.txt

EXPOSE 3100
ENTRYPOINT [ "sh", "-c", "python code/img2text.py" ]

接着制作镜像, 并推到私服上

docker build --no-cache -t aps-explore-biz-show .
docker tag aps-explore-biz-show:latest cr.zw1.paratera.com/zhoujg/aps-explore-biz-show:latest
docker push cr.zw1.paratera.com/zhoujg/aps-explore-biz-show:latest

k8s部署

这里将推理服务和UI服务都部署在一个pod里，下面是相关的配置文件。

将两个服务都放到一个pod里，通过container来区分

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: bentoml
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: bentoml
  replicas: 1 
  template:
    metadata:
      labels:
        app: bentoml
    spec:
      containers:
        - name: aps-explore-model-infer
          image: cr.zw1.paratera.com/zhoujg/aps-explore-model-infer:latest
          args: ["serve", "--host", "0.0.0.0", "--port", "3000"]
          # command: ["/bin/sh"]
          # args: ["-c", "while true; do echo hello; sleep 10;done"]
          ports:
          - containerPort: 3000
          resources:
            limits:
              cpu: 2
              memory: 6Gi
              nvidia.com/gpu: 1
          volumeMounts:
            - name: data
              subPath: bentoml
              mountPath: "/data"
        - name: aps-explore-biz-show
          image: cr.zw1.paratera.com/zhoujg/aps-explore-biz-show:latest
          ports:
          - containerPort: 3100
          resources:
            limits:
              cpu: 1
              memory: 2Gi
          volumeMounts:
            - name: data
              subPath: bentoml
              mountPath: "/data"              
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: jupyter
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu
                operator: In
                values:
                  - rtx-3090

service.yaml

apiVersion: v1
kind: Service
metadata:
  name: bentoml
spec:
  selector:
    app: bentoml
  type: ClusterIP
  ports:
    - protocol: TCP
      name: http-aps-explore-model-infer
      port: 3000
      targetPort: 3000
    - protocol: TCP
      name: http-aps-explore-biz-show
      port: 3100
      targetPort: 3100

ingress.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: bentoml
spec:
  ingressClassName: nginx
  rules:
    - host: aps-explore-img2txt.poc1-be9e3e9b62c8.ing.zw1.paratera.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: bentoml
                port:
                  number: 3100

通过kustomize来部署

kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - deployment.yaml
  - service.yaml
  - ingress.yaml

运用使用

界面效果如下，提交图片，然后输入图生文要求，点击提交按钮就可输出不同的文字。点击Examples下的例子可按示例进行演示。

运行 StableSwarmUI 文生图推理服务运行 File Brower 文件管理器