Unified Kernel Fault Event Framework (UKFEF)

更新时间:
复制 MD 格式

Alibaba Cloud Linux 3 (starting from kernel version 5.10.60-9.al8.x86_64) introduces the Unified Kernel Fault Event Framework (UKFEF). UKFEF collects system anomalies that may pose risks and generates event reports in a unified format. This topic describes the events collected by UKFEF, the format of the event reports, and the interfaces used to control UKFEF.

Background information

An operating system may show certain signs or messages before serious problems occur. During Operations and Maintenance (O&M), you can use this information to predict faults and perform preventive operations. However, this information is scattered across different system modules and is available in various formats. As a result, you may face the following issues when collecting system anomalies:

  • Parsing anomalies and their potential impacts requires specialized knowledge.

  • The various formats of anomalies complicate automated O&M. This requires format matching during information collection, followed by data cleaning.

To solve these problems, Alibaba Cloud Linux 3 includes UKFEF at the kernel layer. UKFEF collects various system anomalies that may pose risks, automatically determines the event severity, and generates event reports in a unified format. The reports include the scenarios in which the problems occurred and the recommended risk levels. This simplifies the identification of system anomalies during O&M. UKFEF also classifies known anomalies and provides system risk reports that were not available in previous kernel versions.

UKFEF generates reports based on multiple dimensions, such as the type, impact, and statistics of anomalies. This helps you efficiently diagnose system anomalies during O&M. In addition, event reports are generated through multiple methods to prevent data loss.UKFEF

Event description

The following table describes the event types, event levels, and report formats that UKFEF uses.

Event information

Description

Event type

UKFEF collects the following common operating system kernel events:

  • soft lockup

  • Read-Copy Update (RCU) stall

  • hung task

  • global out-of-memory (OOM)

  • cgroup OOM

  • page allocation failure

  • list corruption

  • bad mm_struct

  • I/O error

  • EXT4-fs error

  • Machine Check Exception (MCE)

  • fatal signal

  • warning

  • panic

Event level

UKFEF classifies anomaly events into three levels:

  • Slight: Does not affect system operation, but services deployed on the system might experience jitter. You can continue to monitor the event.

  • Normal: The current application process might become abnormal. Take action on the current application, such as using kill, restarting it, or migrating it.

  • Fatal: May have a fatal impact on the system. Migrate your services immediately.

Event report

UKFEF outputs event reports in the following ways:

  • Outputs the details of a single event through kernel logs. The following sample message shows the information format:

    class Fault event[module:type]:messages. At cpu cpuid, task pid(cmdline). Total fault: cnt

    The following are the details:

    • class: The level of the anomaly event.

    • module: The module to which the anomaly event belongs. Modules include sched, mem, io, fs, net, and hardware. If an anomaly is caused by multiple modules, `general` is output.

    • type: The type of the anomaly event.

    • messages: The custom message of the event.

    • cpuid: The ID of the CPU where the anomaly event occurred.

    • pid(cmdline): The process ID (PID) and command line of the process that corresponds to the anomaly event.

      Note

      If the PID is -1, no corresponding process exists.

    • cnt: The total number of times this type of anomaly event has occurred since the system started.

  • Outputs the total count of each type of anomaly event to the /proc/fault_events file. The following sample output shows the file content:

    Total fault events: 0
    Slight: 0
    Normal: 0
    Fatal: 0
    soft lockup: 0
    rcu stall: 0
    hung task: 0
    global oom: 0
    cgroup oom: 0
    page allocation failure: 0
    list corruption: 0
    bad mm_struct: 0
    io error: 0
    ext4 fs error: 0
    mce: 0
    fatal signal: 0
    warning: 0
    panic: 0

Control interfaces

Interface

Description

/proc/sys/kernel/fault_event_enable

Controls whether to enable or disable UKFEF. Valid values:

  • 1: Enables UKFEF.

  • 0: Disables UKFEF.

/proc/sys/kernel/fault_event_print

Controls whether UKFEF outputs event reports. Valid values:

  • 1. Outputs.

  • 0: No output is generated.

/proc/sys/kernel/panic_on_fatal_event

Controls whether to trigger the operating system's Panic mechanism when a Fatal event occurs. Valid values:

  • 1. Trigger.

  • 0: Not triggered.