使用Python探针为vLLM与SGLang推理引擎开启应用监控-应用实时监控服务-阿里云

应用监控的Python探针新增了vLLM/SGLang插件，支持对vLLM/SGLang推理引擎进行可观测。

说明

ARMS目前仅支持对 vLLM/SGLang 框架进行可观测。

PAI-EAS接入

模型在线服务EAS（Elastic Algorithm Service）是PAI产品为实现一站式模型开发部署应用，针对在线推理场景提供的模型在线服务，支持将模型服务部署在公共资源组或专属资源组，实现基于异构硬件（CPU和GPU）的模型加载和数据请求的实时响应。

步骤一：准备环境变量

export ARMS_APP_NAME=xxx   # EAS应用名称。
export ARMS_REGION_ID=xxx   # 对应的阿里云账号的RegionID。
export ARMS_LICENSE_KEY=xxx   # 阿里云 LicenseKey。

步骤二：修改PAI-EAS运行命令

登录PAI控制台，在页面上方选择目标地域，然后进入目标工作空间。
在左侧导航栏选择模型部署 > 模型在线服务（EAS）。
在推理服务页签选择要接入模型观测的应用，然后单击操作列的更新。

修改运行命令。

以接入DeepSeek-R1-Distill-Qwen-7B模型为例。

vLLM原始指令：

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

接入应用监控对应的vLLM指令：

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

新增部分说明：

配置 pipy 仓库, 可以根据实际情况调整。

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;

下载探针安装器。
```
pip3 install aliyun-bootstrap;
```
使用安装器安装探针。
请根据实际地域替换cn-hangzhou。
```
ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
```

SGLang原始命令：

python -m sglang.launch_server --model-path /model_dir

接入应用监控对应的SGLang指令：

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument python -m sglang.launch_server --model-path /model_dir

新增部分说明：

配置 pipy 仓库, 可以根据实际情况调整。

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;

下载探针安装器。
```
pip3 install aliyun-bootstrap;
```
使用安装器安装探针。
请根据实际地域替换cn-hangzhou。
```
ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
```

单击更新。

通用场景模型接入

ARMS 目前只支持官方提供的 vLLM 版本（V0和V1版本）和 SGLang 版本，具体支持的版本范围请参考LLM（大语言模型）服务，用户修改过的版本不支持接入。

ARMS 支持补全和对话两个场景，如果是非流式请求会采集 2 个 Span，流式请求会采集 3 个 Span。

支持的场景	数据处理	采集内容	vLLM V0	vLLM V1	SGLang
chat 对话 or completion 补全	流式	span	http input/output llm_request: key metrics	http input/output	http input/output key metrics reasoning（思考）
	流式	key metrics TTFT/TPOP	支持	不支持	支持
	非流式	span	http input/output	http input/output	http input/output
	非流式	key metrics TTFT/TPOP	不适用	不适用	不适用
Embedding		http	不支持	不支持	不支持
Rerank		http	不支持	不支持	不支持

监控效果

重要Span及Attributes说明

llm_request相关：

Attribute	描述
gen_ai.latency.e2e	端到端的时间
gen_ai.latency.time_in_queue	进入队列的时间
gen_ai.latency.time_in_scheduler	调度时间
gen_ai.latency.time_to_first_token	首包时间
gen_ai.request.id	请求ID