文档

为KServe配置Prometheus监控以监控模型服务的性能和健康状况

更新时间:

KServe提供了一套默认的Prometheus指标来帮助您监控模型服务的性能和健康状况。本文以Qwen-7B-Chat-Int8模型、GPU类型为V100卡为例,介绍如何为KServe框架配置Prometheus监控。

前提条件

步骤一:部署KServe应用

  1. 执行如下命令,部署一个Sklearn的KServe应用。

    arena serve kserve \
        --name=sklearn-iris \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \
        --cpu=1 \
        --memory=200Mi \
        --enable-prometheus=true \
        --metrics-port=8080 \
        "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"

    预期输出:

    service/sklearn-iris-metric-svc created # 名为sklearn-iris-metric-svc的服务被成功创建。
    inferenceservice.serving.kserve.io/sklearn-iris created # KServe的InferenceService资源sklearn-iris已经被创建。
    servicemonitor.monitoring.coreos.com/sklearn-iris-svcmonitor created # ServiceMonito资源被创建,用于集成Prometheus监控系统,收集sklearn-iris-metric-svc服务暴露的监控数据。
    INFO[0004] The Job sklearn-iris has been submitted successfully # Job已经被成功提交至集群。
    INFO[0004] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status

    输出结果表明使用Arena已成功启动了一个使用scikit-learn模型的KServe服务部署流程,同时集成了Prometheus监控。

  2. 执行以下命令,将以下JSON内容写入 ./iris-input.json文件中,以准备推理输入请求。

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. 执行以下命令,从集群中检索Nginx Ingress网关的IP地址以及InferenceService的外部可访问URL的主机名部分。

    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
  4. 执行以下命令,使用压测工具Hey多次访问服务生成监控数据。

    说明

    Hey压测工具的详细介绍,请参见Hey

    hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predict

    预期输出:

    展开查看预期输出

    Summary:
      Total:	120.0296 secs
      Slowest:	0.1608 secs
      Fastest:	0.0213 secs
      Average:	0.0275 secs
      Requests/sec:	727.3875
      
      Total data:	1833468 bytes
      Size/request:	21 bytes
    
    Response time histogram:
      0.021 [1]	|
      0.035 [85717]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      0.049 [1272]	|■
      0.063 [144]	|
      0.077 [96]	|
      0.091 [44]	|
      0.105 [7]	|
      0.119 [0]	|
      0.133 [0]	|
      0.147 [11]	|
      0.161 [16]	|
    
    
    Latency distribution:
      10% in 0.0248 secs
      25% in 0.0257 secs
      50% in 0.0270 secs
      75% in 0.0285 secs
      90% in 0.0300 secs
      95% in 0.0315 secs
      99% in 0.0381 secs
    
    Details (average, fastest, slowest):
      DNS+dialup:	0.0000 secs, 0.0213 secs, 0.1608 secs
      DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
      req write:	0.0000 secs, 0.0000 secs, 0.0225 secs
      resp wait:	0.0273 secs, 0.0212 secs, 0.1607 secs
      resp read:	0.0001 secs, 0.0000 secs, 0.0558 secs
    
    Status code distribution:
      [200]	87308 responses

    输出结果总结了系统在某次测试中的性能表现,包括处理速度、数据吞吐量、响应延迟等关键指标,有助于评估系统的效率和稳定性。

  5. (可选)手动获取应用Metrics,确认Metrics正常暴露。

    以下将演示从ACK集群中的特定Pod(与sklearn-iris相关的)收集监控指标数据,并在本地查看这些数据,无需直接登录到Pod或暴露Pod的端口到外部网络。

    1. 执行以下命令,将名称中包含sklearn-iris的Pod(通过$POD_NAME变量指定)的8080端口转发到本地主机的8080端口。即发送到本地8080端口的请求都将会被透明地转发到Pod的8080端口上。

      # 获取Pod名称。
      POD_NAME=`kubectl get po|grep sklearn-iris |awk -F ' ' '{print $1}'`
      # 通过port-forward将pod的8080端口转发到本地。
      kubectl port-forward pod/$POD_NAME 8080:8080

      预期输出:

      Forwarding from 127.0.0.1:8080 -> 8080
      Forwarding from [::1]:8080 -> 8080

      输出结果表明无论是通过IPv4还是IPv6的本地连接尝试,都能够被正确地转发到Pod的8080端口。

    2. 在浏览器中输入如下内容,访问Pod的8080端口,查看Metrics。

      http://localhost:8080/metrics

      预期输出:

      展开查看预期输出

      # HELP python_gc_objects_collected_total Objects collected during gc
      # TYPE python_gc_objects_collected_total counter
      python_gc_objects_collected_total{generation="0"} 10298.0
      python_gc_objects_collected_total{generation="1"} 1826.0
      python_gc_objects_collected_total{generation="2"} 0.0
      # HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
      # TYPE python_gc_objects_uncollectable_total counter
      python_gc_objects_uncollectable_total{generation="0"} 0.0
      python_gc_objects_uncollectable_total{generation="1"} 0.0
      python_gc_objects_uncollectable_total{generation="2"} 0.0
      # HELP python_gc_collections_total Number of times this generation was collected
      # TYPE python_gc_collections_total counter
      python_gc_collections_total{generation="0"} 660.0
      python_gc_collections_total{generation="1"} 60.0
      python_gc_collections_total{generation="2"} 5.0
      # HELP python_info Python platform information
      # TYPE python_info gauge
      python_info{implementation="CPython",major="3",minor="9",patchlevel="18",version="3.9.18"} 1.0
      # HELP process_virtual_memory_bytes Virtual memory size in bytes.
      # TYPE process_virtual_memory_bytes gauge
      process_virtual_memory_bytes 1.406291968e+09
      # HELP process_resident_memory_bytes Resident memory size in bytes.
      # TYPE process_resident_memory_bytes gauge
      process_resident_memory_bytes 2.73207296e+08
      # HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
      # TYPE process_start_time_seconds gauge
      process_start_time_seconds 1.71533439115e+09
      # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
      # TYPE process_cpu_seconds_total counter
      process_cpu_seconds_total 228.18
      # HELP process_open_fds Number of open file descriptors.
      # TYPE process_open_fds gauge
      process_open_fds 16.0
      # HELP process_max_fds Maximum number of open file descriptors.
      # TYPE process_max_fds gauge
      process_max_fds 1.048576e+06
      # HELP request_preprocess_seconds pre-process request latency
      # TYPE request_preprocess_seconds histogram
      request_preprocess_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.01",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.025",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.05",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.075",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.1",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.25",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.5",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="0.75",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="1.0",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="2.5",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="5.0",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="7.5",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="10.0",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_bucket{le="+Inf",model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_count{model_name="sklearn-iris"} 259709.0
      request_preprocess_seconds_sum{model_name="sklearn-iris"} 1.7146860011853278
      # HELP request_preprocess_seconds_created pre-process request latency
      # TYPE request_preprocess_seconds_created gauge
      request_preprocess_seconds_created{model_name="sklearn-iris"} 1.7153354578475933e+09
      # HELP request_postprocess_seconds post-process request latency
      # TYPE request_postprocess_seconds histogram
      request_postprocess_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.01",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.025",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.05",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.075",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.1",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.25",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.5",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="0.75",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="1.0",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="2.5",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="5.0",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="7.5",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="10.0",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_bucket{le="+Inf",model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_count{model_name="sklearn-iris"} 259709.0
      request_postprocess_seconds_sum{model_name="sklearn-iris"} 1.625360683305189
      # HELP request_postprocess_seconds_created post-process request latency
      # TYPE request_postprocess_seconds_created gauge
      request_postprocess_seconds_created{model_name="sklearn-iris"} 1.7153354578482144e+09
      # HELP request_predict_seconds predict request latency
      # TYPE request_predict_seconds histogram
      request_predict_seconds_bucket{le="0.005",model_name="sklearn-iris"} 259708.0
      request_predict_seconds_bucket{le="0.01",model_name="sklearn-iris"} 259708.0
      request_predict_seconds_bucket{le="0.025",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.05",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.075",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.1",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.25",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.5",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="0.75",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="1.0",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="2.5",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="5.0",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="7.5",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="10.0",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_bucket{le="+Inf",model_name="sklearn-iris"} 259709.0
      request_predict_seconds_count{model_name="sklearn-iris"} 259709.0
      request_predict_seconds_sum{model_name="sklearn-iris"} 47.95311741752084
      # HELP request_predict_seconds_created predict request latency
      # TYPE request_predict_seconds_created gauge
      request_predict_seconds_created{model_name="sklearn-iris"} 1.7153354578476949e+09
      # HELP request_explain_seconds explain request latency
      # TYPE request_explain_seconds histogram

      从输出结果可以看到该Pod内应用提供的各种性能和状态指标信息,即这个请求最终被转发到了Pod内的应用服务上。

步骤二:查询KServe应用指标

  1. 登录ARMS控制台

  2. 在左侧导航栏选择Prometheus监控 > 实例列表,进入可观测监控 Prometheus 版的实例列表页面。

  3. 在页面顶部选择容器服务K8s集群所在的地域,然后选择实例类型Prometheus for 容器服务的目标实例名称,进入对应实例的页面。

  4. 在左侧导航栏选择大盘列表,然后单击Kubernetes Pod大盘,进入Grafana页面。

  5. 在左侧导航栏单击Explore,输入查询语句request_predict_seconds_bucket,查询应用指标值。

    说明

    数据采集约有5分钟延迟。

    image

常见问题与解决方案

常见问题

如何确认request_predict_seconds_bucket指标数据已经采集成功,以及如果指标数据未采集成功该如何处理?

解决方案

  1. 登录ARMS控制台

  2. 在左侧导航栏选择Prometheus监控 > 实例列表,进入可观测监控 Prometheus 版的实例列表页面。

  3. 在页面顶部选择容器服务K8s集群所在的地域,然后选择实例类型Prometheus for 容器服务的目标实例名称,进入对应实例的页面。

  4. 在左侧导航栏选择服务发现,然后单击Targets页签,如果看到default/sklearn-iris-svcmonitor/0 (1/1 up),则说明指标数据采集成功。

    如果指标数据没有采集成功,请提交工单获取技术支持。

相关文档

关于KServe框架提供的默认指标信息,请参见社区文档KServe Prometheus Metrics