日志监控能力接入

接入Loki

声明式 Log collect

  • 根据 'loki/logdir' annotations 指定采集文件,支持

    glob patterns

  • 利用

    kubernetes_sd_config

     针对

    metadata.labels

      打 Loki 标签

限制、缺陷:

  • Loki 标签为 pod 纬度,无法针对 container 级别独立打标签

# POD yaml
metadata:
  annotations:
    # loki/logdir - 
    #  emptyDir 说明 - ${emptyDir volume name}/${glob filter pattern} ; "," 分隔支持多个 及 glob pattern 通配;e.g.,:
    loki/logdir: app-log-dir/*,log,app-log-dir2/*
    #  CSI volume 说明 - CSI 挂载卷下的文件
    loki/logdir: myapp/*/log/*.log
    #  CSI volume 跟 emptyDir logs 可以全部配置在 `loki/logdir` annotation
    loki/logdir: myapp/*/log/*.log,app-log-dir/*,log,app-log-dir2/*
spec:
  containers:
  - ...
    volumeMounts:
    # volume mount name 必须是 `app-log-dir`,只采集文件夹下的 *.log
    - name: app-log-dir
      mountPath: <my_app_log_file_dir>
    - name: app-log-dir2
      mountPath: <my_app_log_file_dir>
    # assumed log reside in /data/myapp/01/log/*.log 及 /data/myapp/02/log/*.log
    - name: csi-data-volume-1
      mountPath: /data
  
  volumes:
  # 必须是 emptyDir
  - name: app-log-dir
    emptyDir: {}
  - name: app-log-dir2
    emptyDir: {}
  - name: csi-data-volume-1
    persistentVolumeClaim:
      # assumed following claim required CSI storage class
      claimName: mypod-volume-0

优势:

限制、缺陷:

  • 多个 replica pod 无法挂载到相同节点或者产生文件冲突

  • 需要手动删除 hostPath 上的日志文件

# POD yaml
spec:
  containers:
  - ...
    volumeMounts:
    # volume mount name 必须是 `app-log-dir`
    - name: app-log-dir
      mountPath: <my_app_log_file_dir>

  volumes:
  - hostPath:
      # hostPath 必须是 /var/log/ 的子文件夹,hostPath /var/log/ 会挂载为 
      # promtail daemonset /var/log/system-log/
      path: /var/log/<my_app>
      type: Directory
    name: app-log-dir
## <namespace>/configmap/loki-stack-promtail
...
data:
  promtail.yaml: |
    ...
    scrape_configs:
    ...
    - job_name: <my_app>
      pipeline_stages:
      static_configs:
      - labels:
          # 日志文件(s) path, 支持 glob patterns (e.g., /var/log/**/*.log). 
          __path__: /var/log/system-log/<my_app>/<log_file/log_file_patterns>
          job: <my_app>
          # Additional labels to assign to the logs
          # [ <labelname>: <labelvalue> ... ]
          # e.g.:
          # mylabel: myvalue
        targets:
        - localhost

优势:

  • 兼容 ECP logpilot 日志方案,提供自定义 labels

  • 针对安全容器适配的方案

  • 日志 volume 非必要 emptyDir

缺陷:

  • per pod per sidecar 导致消耗资源较高 (0.01vCPU/64MiB RAM per pod)

  • Promtail Daemonset Pod 开启 Push Server 功能做为 Loki Server Push API 代理,主要保障 Loki Server 不会有大量的 promtail agent 并发链接的压力

服务发布容器页面输入日志文件后,平台会对Deployment/Statefulset 添加环境变量以及emptydir的volume

  •     环境变量 及日志目录的emptydir挂载,参考以下示例:

# example pod
kind: Pod
spec:
  containers:
  - name: some-container
    env:
    # "aliyun_logs_${name}" pattern
    - name: aliyun_logs_myapp
      value: /logs/*.log
    # "aliyun_logs_${name}_tags" pattern
    - name: aliyun_logs_myapp_tags
      value: app=myapp
    volumeMounts:
    - name: logdir
      mountPath: /logs
  volumes:
  - name: logdir
    emptyDir: {}

配置报警

格式跟Prometheus Alerting Rules格式一样,但使用LogQL

参考:

示例:

## <namespace>/configmap/loki-stack-alerting-rules
data:
  groups:
    - name: should_fire
      rules:
        - alert: HighPercentageError
          expr: |
            sum(rate({app="foo", env="production"} |= "error" [5m])) by (job)
              /
            sum(rate({app="foo", env="production"}[5m])) by (job)
              > 0.05
          for: 10m
          labels:
            severity: page
          annotations:
            summary: High request latency
    - name: credentials_leak
      rules: 
        - alert: http-credentials-leaked
          annotations: 
            message: "{{ $labels.job }} is leaking http basic auth credentials."
          expr: 'sum by (cluster, job, pod) (count_over_time({namespace="prod"} |~ "http(s?)://(\\w+):(\\w+)@" [5m]) > 0)'
          for: 10m
          labels: 
            severity: critical

Promtail Admission Webhook

  • 针对 LogPilot 兼容环境变量对 POD resources 进行自动 annotations&lables 转换 或者 Promtail Sidecar 功能注入

## 开启 Promtail sidecar (default: false)
# ENABLE_SIDECAR: 'false'

# Sidecar 针对 pod `spec.runtimeClassName` value 过滤,i.e., 安全容器为 'rund', 如不过滤会为所有有 LogPilot 环境变量 Pods 执行适配 (default: "")
# RUNTIME_CLASS_NAME_FILTER:

Loki日志存储空间需求

References

替代方案

替代方案主要是通过一些工具,将日志在容器上直接转化成Prometheus指标提供出来。

grok-exporter

https://github.com/fstab/grok_exporter

https://help.aliyun.com/document_detail/129387.html#title-mvu-d5x-2pn

可能相对适合Java应用,内置了一些模式,符合其格式的可以直接使用。

mtail

https://github.com/google/mtail

非常灵活,规则配置可能会稍微麻烦一点。

阿里云首页 云原生应用交付平台ADP 相关技术圈