为了采集ACS集群中指定GPU-HPN节点或虚拟节点的Metrics数据,ACS在多个采集端点中提供了不同类型的指标。您可以通过修改Prometheus监控配置来采集目标节点的Metrics。
功能介绍
在ACS的架构设计下,同一集群内的多个虚拟节点会共享同一个IP。这导致在采集单个虚拟节点的数据时,会返回所有虚拟节点的全量数据。而Prometheus的常见采集配置会通过Kubelet Service来采集所有节点的Metrics,会导致出现Metrics重复的现象。
为了解决这个问题,ACS支持通过指定节点名称过滤Metrics数据,结果中会仅包括该节点对应的Pod和Node数据,具体如下。
采集端点 | 参数描述 | 指标类型 |
|
| 目标节点内Pod级别的CPU、内存、GPU等用量指标。 |
|
| 重要 仅支持GPU-HPN类型节点。 节点级别CPU、内存、GPU等用量指标。具体指标请参见ACS GPU-HPN节点级别监控指标。 |
前提条件
acs-virtual-node组件的版本为v2.12.0-acs.10及以上。
您可以在ACS集群管理页的左侧导航栏选择
,在核心组件页签下查看acs-virtual-node组件的版本,或者进行升级操作。修改Prometheus监控配置
您可以修改Prometheus监控配置来采集指定虚拟节点的Metrics,请根据您使用的Prometheus方案选择配置方法。
阿里云可观测监控Prometheus版
默认支持,无需额外操作。
请将Prometheus监控看板和探针升级到最新版本,确保可以看到完整的监控大盘。升级方法请参见如何升级ACS集群的Prometheus监控看板?。
社区版Prometheus Operator
如果您使用的Prometheus方案为社区Prometheus Operator方案,以及ACK应用市场的ack-prometheus-operator,需要增加以下ServiceMonitor CR配置。
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: virtual-kubelet-acs
namespace: monitoring
labels:
k8s-app: kubelet
# 增加该label用于prometheus-operator自动管理。
release: prometheus-operator
spec:
jobLabel: k8s-app
selector:
matchLabels:
k8s-app: kubelet
namespaceSelector:
matchNames:
- kube-system
endpoints:
- port: https-metrics-cadvisor
interval: 15s
scheme: https
path: /metrics/cadvisor
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecureSkipVerify: true
relabelings:
# 增加指定nodeName的查询参数。
- sourceLabels: [__meta_kubernetes_endpoint_address_target_name]
targetLabel: __param_nodeName
replacement: ${1}
action: replace
- port: https-metrics-node
interval: 15s
scheme: https
path: /metrics/node
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecureSkipVerify: true
relabelings:
# 增加指定nodeName的查询参数。
- sourceLabels: [__meta_kubernetes_endpoint_address_target_name]
targetLabel: __param_nodeName
replacement: ${1}
action: replace
开源Prometheus
在开源Prometheus中找到Prometheus的配置文件(通常位于/etc/prometheus/prometheus.yml
或者您自定义的配置目录下),增加以下采集配置。
scrape_configs:
...其他job配置。
- job_name: monitoring/acs-virtual-kubelet/cadvisor
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics/cadvisor
scheme: https
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: kubelet
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https-metrics
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
regex: (.*)
target_label: container
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: https-metrics
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_name]
separator: ;
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_name]
separator: ;
target_label: __param_nodeName
replacement: ${1}
action: replace
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- kube-system
- job_name: monitoring/acs-virutal-kubelet/node
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics/node
scheme: https
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: kubelet
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https-metrics
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
regex: (.*)
target_label: container
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: https-metrics
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_name]
separator: ;
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_name]
separator: ;
target_label: __param_nodeName
replacement: ${1}
action: replace
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- kube-system