本文介绍用于监控ACK集群的内置告警监控规则。

在容器服务侧开启告警功能后,容器服务会将ACK集群的事件日志存储到日志服务目标Project下名为k8s-event的Logstore中,并将其内置告警监控规则同步到日志服务告警中心,用于监控该Logstore。

ACK内置告警监控规则列表如下表所示。如果您对告警监控规则有更多定制化需求,可创建自定义的告警监控规则。具体操作,请参见创建日志告警监控规则

告警监控规则ID 告警监控规则名称 说明 查询和分析语句 触发条件 分组评估
sls_app_ack_ccm_at_add_node_fail 添加节点失败 每5分钟检查一次,触发条件为存在添加节点失败的事件(kubernetes add node failed)。 eventId.reason : AddNodeFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_ccm_at_create_route_fail 创建路由失败 每5分钟检查一次,触发条件为存在创建路由失败的事件(kubernetes create route failed)。 eventId.reason : CreateRouteFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_ccm_at_del_node_fail 删除节点失败 每5分钟检查一次,触发条件为存在删除节点失败的事件(kubernetes delete node failed) eventId.reason : DeleteNodeFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_ccm_at_del_slb_fail 删除LoadBalancer失败 每5分钟检查一次,触发条件为存在删除LoadBalancer失败的事件(kubernetes slb delete failed)。 eventId.reason : DeleteLoadBalancerFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name 有数据匹配,cnt > 0 标签自定义:namespace、kind、object_name
sls_app_ack_ccm_at_no_ava_slb 无可用LoadBalancer 每5分钟检查一次,触发条件为存在无可用LoadBalancer的事件(kubernetes slb not available)。 eventId.reason : UnAvailableLoadBalancer | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name 有数据匹配,cnt > 0 标签自定义:namespace、kind、object_name
sls_app_ack_ccm_at_sync_route_fail 同步路由失败 每5分钟检查一次,触发条件为存在同步路由失败的事件(kubernetes sync route failed)。 eventId.reason : SyncRouteFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_ccm_at_sync_slb_fail 同步LoadBalancer失败 每5分钟检查一次,触发条件为存在同步LoadBalancer失败的事件(kubernetes slb sync failed)。 eventId.reason : SyncLoadBalancerFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name 有数据匹配,cnt > 0 标签自定义:namespace、kind、object_name
sls_app_ack_csi_at_device_busy 挂载点正在被进程占用,卸载挂载点失败 每5分钟检查一次,触发条件为存在挂载点正在被进程占用,卸载挂载点失败的事件(kubernetes csi disk device busy)。 eventId.reason : DeviceBusy | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name 有数据匹配,cnt > 0 标签自定义:namespace、kind、object_name
sls_app_ack_csi_at_disk_iohang 云盘Hang 每5分钟检查一次,触发条件为存在云盘IOHang的事件(kubernetes csi ioHang)。 eventId.reason : DeviceBusy | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name 有数据匹配,cnt > 0 标签自定义:namespace、kind、object_name
sls_app_ack_csi_at_disk_no_portable 容器数据卷暂不支持包年包月类型云盘 每5分钟检查一次,触发条件为存在容器数据卷暂不支持包年包月类型云盘的事件(kubernetes csi not protable) eventId.reason : ProvisioningFailed and eventId.message : DiskNotPortable| SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name 有数据匹配,cnt > 0 标签自定义: namespace、kind、object_name
sls_app_ack_csi_at_invalid_disk_size 云盘大小不符合云盘规定,少于20Gi 每5分钟检查一次,触发条件为存在云盘大小不符合云盘规定,少于20Gi的事件(kubernetes csi invalid disk size)。 eventId.reason : ProvisioningFailed and eventId.message : InvalidDiskSize| SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name 有数据匹配,cnt > 0 标签自定义: namespace、kind、object_name
sls_app_ack_csi_at_latency_too_high 磁盘绑定的pvc发生slowIO 每5分钟检查一次,触发条件为存在磁盘绑定的pvc发生slowIO的事件(kubernetes csi pvc latency load too high)。 eventId.reason : LatencyTooHigh | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name 有数据匹配,cnt > 0 标签自定义:namespace、kind、object_name
sls_app_ack_csi_at_no_ava_disk 无可用云盘 每5分钟检查一次,触发条件为存在无可用云盘的事件(kubernetes csi no available disk)。 eventId.reason : ResourceInvalid and eventId.message : "get disk"| SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name 有数据匹配,cnt > 0 标签自定义:namespace、kind、object_name
sls_app_ack_csi_at_no_enough_disk_space 磁盘容量超过水位阈值 每5分钟检查一次,触发条件为存在磁盘容量超过水位阈值的事件(kubernetes csi not enough disk space)。 eventId.reason : NotEnoughDiskSpace| SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name 有数据匹配,cnt > 0 标签自定义:namespace、kind、object_name
sls_app_ack_common_at_docker_hung 集群节点docker进程异常 每5分钟检查一次,触发条件为存在集群节点docker进程异常的事件(kubernetes node docker hang)。 eventId.reason:DockerHung or eventId.reason: "docker daemon is offline" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_common_at_common_err K8s通用Error警示事件 每5分钟检查一次,触发条件为存在集群通用Error警示事件(kubernetes cluster error event)。 level : Error | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name 有数据匹配,cnt > 0 标签自定义:namespace、kind、object_name
sls_app_ack_common_at_eviction 集群驱逐事件 每5分钟检查一次,触发条件为存在集群驱逐事件(kubernetes eviction event)。 * | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log where "eventId.reason" like '%Evict%' GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义: namespace、node_name
sls_app_ack_common_at_gpu_xid_error 集群GPU的XID错误事件 每5分钟检查一次,触发条件为存在集群GPU的XID错误事件(kubernetes gpu xid error event)。 eventId.reason : NodeHasNvidiaXidError | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_common_at_k8s_image_pull_fail 集群镜像拉取失败 每5分钟检查一次,触发条件为存在集群镜像拉取失败事件(kubernetes image pull back off event)。 eventId.reason : Failed and eventId.message : ImagePullBackOff | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name 有数据匹配,cnt > 0 标签自定义:namespace、pod_name、node_name
sls_app_ack_ingress_at_err_reload_nginx Ingress重载配置失败 每5分钟检查一次,触发条件为存在Ingress重载配置失败事件(kubernetes ingress reload config error)。 eventId.reason : RELOAD and eventId.message : "Error reloading NGINX" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name 有数据匹配,cnt > 0 标签自定义:namespace、pod_name、node_name
sls_app_ack_common_at_k8s_no_ip 集群节点IP资源不足 每5分钟检查一次,触发条件为存在集群节点IP资源不足事件(kubernetes ip not enough event)。 InvalidVSwitchId.IpNotEnough or IpNotEnough | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义: namespace、node_name
sls_app_ack_nlc_at_destr_node_fail 托管节点池销毁节点发生错误 每5分钟检查一次,触发条件为存在托管节点池销毁节点发生错误的事件(kubernetes node pool nlc destory node failed)。 eventId.reason : "NLC.Task.DestroyNode.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_nlc_at_drain_node_fail 托管节点池节点排水失败 每5分钟检查一次,触发条件为存在托管节点池节点排水失败的事件(kubernetes node pool nlc drain node failed)。 eventId.reason : "NLC.Task.DrainNode.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_nlc_at_emp_task_cmd 托管节点池未提供任务的具体命令 每5分钟检查一次,触发条件为存在托管节点池未提供任务的具体命令的事件(kubernetes node pool nlc delete node failed: EmptyTaskCommand)。 eventId.reason : "NLC.Task.EmptyTaskCommand" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义: namespace、node_name
sls_app_ack_nlc_at_op_not_found 托管节点池发生未知的修复操作 每5分钟检查一次,触发条件为存在托管节点池发生未知的修复操作的事件(kubernetes node pool nlc delete node failed: Task.Operation.NotFound)。 eventId.reason : "NLC.Task.Operation.NotFound" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_nlc_at_repair_fail 托管节点池自愈任务失败 每5分钟检查一次,触发条件为存在托管节点池自愈任务失败的事件(kubernetes node pool nlc self repair failed)。 eventId.reason : "NLC.AutoRepairTask.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_nlc_at_reset_ecs_fail 托管节点池重置ECS失败 每5分钟检查一次,触发条件为存在托管节点池重置ECS失败的事件(kubernetes node pool nlc reset ecs failed)。 eventId.reason : "NLC.Task.ResetECS.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_nlc_at_restart_ecs_fail 托管节点池重启ECS失败 每5分钟检查一次,触发条件为存在托管节点池重启ECS失败的事件(kubernetes node pool nlc restart ecs failed)。 eventId.reason : "NLC.Task.RestartECS.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_nlc_at_restart_ecs_wait_fail 托管节点池重启ECS未达到终态 每5分钟检查一次,触发条件为存在托管节点池重启ECS未达到终态的事件(kubernetes node pool nlc restart ecs wait timeout)。 eventId.reason : "NLC.Task.RestartECS.WaitNodeReady.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_nlc_at_runcommand_fail 托管节点池命令执行失败 每5分钟检查一次,触发条件为存在托管节点池命令执行失败的事件(kubernetes node pool nlc run command failed)。 eventId.reason : "NLC.Task.RunCommand.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_nlc_at_url_mode_unimpl 托管节点池出现未实现的任务模式 每5分钟检查一次,触发条件为存在托管节点池出现未实现的任务模式的事件(kubernetes nodde pool nlc delete node failed: Task.URL.Mode.Unimplemented)。 eventId.reason : "NLC.Task.URL.Mode.Unimplemented" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_common_at_k8s_no_disk 集群节点磁盘空间不足 每5分钟检查一次,触发条件为存在集群节点磁盘空间不足的事件(kubernetes node disk pressure event)。 eventId.reason : NodeHasDiskPressure | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_common_at_node_down 集群节点下线 每5分钟检查一次,触发条件为存在集群节点下线事件(kubernetes node down event)。 eventId.reason: NodeNotReady and eventId.message: "status is now: NodeNotReady" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_common_at_node_fd_pressure 集群节点文件句柄过多 每5分钟检查一次,触发条件为存在集群节点文件句柄过多的事件(kubernetes node fd pressure event)。 eventId.reason : NodeHasFDPressure | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_common_at_node_pid_pressure 集群节点进程数过多 每5分钟检查一次,触发条件为存在集群节点进程数过多的事件(kubernetes node pid pressure event)。 eventId.reason : PIDPressure or eventId.reason : NodeHasPIDPressure | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_common_at_k8s_pleg_warn 集群节点PLEG异常 每5分钟检查一次,触发条件为存在集群节点PLEG异常事件(kubernetes node pleg error event)。 eventId.message : "PLEG is not healthy" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_common_at_node_restart 集群节点重启 每5分钟检查一次,触发条件为存在集群节点重启事件(kubernetes node restart event)。 eventId.reason : NodeRebooted or eventId.reason : Rebooted | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_common_at_k8s_time_sync_err 集群节点时间服务异常 每5分钟检查一次,触发条件为存在集群节点时间服务异常事件(kubernetes node ntp down)。 eventId.reason : NTPIsDown | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace,node_name
sls_app_ack_common_at_k8s_pod_start_fail 集群容器副本启动失败 每5分钟检查一次,触发条件为存在集群容器副本启动失败事件(kubernetes pod start failed event)。 eventId.reason : Failed and eventId.involvedObject.kind : Pod not eventId.message : ImagePullBackOff | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name 有数据匹配,cnt > 0 标签自定义:namespace、pod_name、node_name
sls_app_ack_common_at_k8s_pod_oom 集群容器副本OOM 每5分钟检查一次,触发条件为存在集群容器副本OOM事件(kubernetes pod oom event)。 eventId.reason : PodOOMKilling | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name 有数据匹配,cnt > 0 标签自定义:namespace、pod_name、node_name
sls_app_ack_common_at_k8s_ps_hung 集群节点进程异常 每5分钟检查一次,触发条件为存在集群节点进程异常事件(kubernetes ps process hang event)。 eventId.reason : PSProcessIsHung | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_common_at_no_resource 集群节点调度资源不足 每5分钟检查一次,触发条件为存在集群节点调度资源不足事件(kubernetes node resource insufficient)。 eventId.reason : FailedScheduling and Insufficient | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name 有数据匹配,cnt > 0 标签自定义:namespace、pod_name、node_name
sls_app_ack_si_at_conf_high_risk 安全巡检发现高危风险配置 每5分钟检查一次,触发条件为存在安全巡检发现高危风险配置的事件(kubernetes high risks have be found after running config audit)。 eventId.reason : SecurityInspectorConfigAuditHighRiskFound | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name 有数据匹配,cnt > 0 标签自定义:namespace、pod_name、node_name
sls_app_ack_terway_at_alloc_ip_fail Terway分配IP失败 每5分钟检查一次,触发条件为存在Terway分配IP失败的事件(kubernetes terway allocate ip error)。 eventId.reason : AllocIPFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_terway_at_allocate_fail Terway分配网络资源失败 每5分钟检查一次,触发条件为存在Terway分配网络资源失败的事件(kubernetes allocate resource error)。 eventId.reason : AllocResourceFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_terway_at_config_check Terway触发PodIP配置检查 每5分钟检查一次,触发条件为存在Terway触发PodIP配置检查的事件(kubernetes terway execute pod ip config check)。 eventId.reason : ConfigCheck | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as pod_name, hostname as node_name, COUNT(*) as cnt from log GROUP by namespace, pod_name, node_name 有数据匹配,cnt > 0 标签自定义:namespace、pod_name、node_name
sls_app_ack_terway_at_dispose_fail Terway回收网络资源失败 每5分钟检查一次,触发条件为存在Terway回收网络资源失败的事件(kubernetes dispose resource error)。 eventId.reason : DisposeResourceFailed | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_terway_at_invalid_resource Terway资源无效 每5分钟检查一次,触发条件为存在Terway资源无效事件(kubernetes terway have invalid resource)。 eventId.reason : ResourceInvalid | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_terway_at_parse_fail 解析Ingress带宽配置失败 每5分钟检查一次,触发条件为存在解析Ingress带宽配置失败事件(kubernetes terway parse k8s.aliyun.com/ingress-bandwidth annotation error)。 eventId.reason : ParseFailed and eventId.message : "Parse ingress bandwidth failed"| SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_terway_at_vir_mode_change Terway虚拟模式变更 每5分钟检查一次,触发条件为存在Terway虚拟模式变更事件(kubernetes virtual mode changed)。 eventId.reason : VirtualModeChanged | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.name" as node_name, COUNT(*) as cnt from log GROUP by namespace, node_name 有数据匹配,cnt > 0 标签自定义:namespace、node_name
sls_app_ack_common_at_common_warn K8s通用Warn警示事件 每5分钟检查一次,触发条件为存在K8s通用Warn警示事件(kubernetes cluster warn event)。 level : Warning and not "Error updating Endpoint Slices for Service" and not (eventId.reason: AccessACRApiFailed and eventId.message:USER_NOT_EXIST) and not eventId.reason: "CIS.ScheduleTask.Warning" and not eventId.reason: "CIS.ScheduleTask.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt from log GROUP by namespace, kind, object_name 有数据匹配,cnt > 0 标签自定义:namespace、kind、object_name