Alibaba Cloud Linux 2系统的ECS实例中断处理释放内存页时由于访问空指针导致系统宕机

问题描述

在符合如下条件的Alibaba Cloud Linux 2实例中,系统运行时出现系统宕机问题。

  • 镜像:Alibaba Cloud Linux 2.1903 LTS 64位。

  • 内核:4.19.91-21.al7.x86_64及之前的内核版本。

系统宕机,且出现如下调用栈信息。

[7674143.032169] general protection fault: 0000 [#1] SMP PTI
[7674143.033229] CPU: 2 PID: 23701 Comm: kube-state-metr Not tainted 4.19.91-19.1.al7.x86_64 #1
[7674143.034412] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014
[7674143.035489] RIP: 0010:free_one_page+0x2c4/0x440
[7674143.036259] Code: 89 e9 b8 ff ff ff ff 4d 89 70 08 d3 e0 31 f6 4c 89 ff 48 63 d0 e8 ac ff 01 00 eb b2 48 83 7a 08 00 0f 85 aa fd ff ff 48 8b 02 <4c> 3b 78 40 0f 85 9d fd ff ff 80 78 77 00 4c 0f 45 c2 e9 90 fd ff
[7674143.038795] RSP: 0000:ffff926a5fa83c10 EFLAGS: 00010046
[7674143.039723] RAX: fe9f1635d944f100 RBX: 0000000000000000 RCX: 0000000000000003
[7674143.040915] RDX: ffffa14a86fcfae0 RSI: ffffe2f184224a00 RDI: ffff926a7ffd7010
[7674143.042363] RBP: 0000000000000003 R08: 0000000000000000 R09: ffff92634892f80c
[7674143.043891] R10: 000000000000004e R11: 000000000000000c R12: 0000000000108928
[7674143.045435] R13: ffffe2f180000000 R14: ffffe2f184224a00 R15: ffff926a7ffd6b80
[7674143.046815] FS:  000000c0003a4090(0000) GS:ffff926a5fa80000(0000) knlGS:0000000000000000
[7674143.048132] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7674143.049236] CR2: 000000c005600000 CR3: 0000000449dd4006 CR4: 00000000003606e0
[7674143.050477] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[7674143.051744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[7674143.053011] Call Trace:
[7674143.053814]  <IRQ>
[7674143.054614]  __free_pages_ok+0x14e/0x2b0
[7674143.055653]  page_to_skb.isra.70+0x2b2/0x310
[7674143.056774]  receive_mergeable+0x3ba/0xb50
[7674143.057793]  ? tcp_v4_rcv+0xc3b/0xdc0
[7674143.058765]  receive_buf+0x2b9/0xa00
[7674143.059737]  ? ip_local_deliver_finish+0x9f/0x1f0
[7674143.060877]  ? detach_buf+0x68/0x110
[7674143.061877]  virtnet_poll+0x141/0x320
[7674143.062896]  net_rx_action+0x127/0x320
[7674143.063926]  __do_softirq+0xd1/0x28c
[7674143.064985]  irq_exit+0xd2/0xf0
[7674143.065975]  do_IRQ+0x54/0xe0
[7674143.066967]  common_interrupt+0xf/0xf
[7674143.068023]  </IRQ>
[7674143.068952] RIP: 0010:compact_zone_order+0x6e/0xd0
[7674143.070134] Code: 95 c0 85 db 89 74 24 70 48 89 e6 89 84 24 80 00 00 00 0f 94 c0 65 48 8b 1c 25 00 4d 01 00 48 89 a3 a8 08 00 00 4c 89 64 24 50 <89> 54 24 6c 44 89 44 24 78 44 89 4c 24 7c 88 84 24 84 00 00 00 88
[7674143.073642] RSP: 0000:ffffa14a86fcfae0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffd7
[7674143.075240] RAX: 0000000000000000 RBX: ffff926590c15f00 RCX: 0000000000000000
[7674143.076813] RDX: 00000000002342ca RSI: ffffa14a86fcfae0 RDI: ffffa14a86fcfaf0
[7674143.078413] RBP: ffffa14a86fcfb80 R08: 0000000000000040 R09: 0000000000000002
[7674143.080007] R10: 0000000000000009 R11: 0000000000000000 R12: ffff926a7ffd6000
[7674143.081689] R13: 0000000000000009 R14: ffffa14a86fcfcf8 R15: ffff926a7ffd7ce0
[7674143.083332]  try_to_compact_pages+0x19f/0x250
[7674143.084709]  __alloc_pages_direct_compact+0x83/0x160
[7674143.086153]  __alloc_pages_nodemask+0xdde/0xf60
[7674143.087579]  ? tcp_schedule_loss_probe+0xe3/0x150
[7674143.089240]  ? tcp_write_xmit+0x2c8/0xf40
[7674143.090679]  ? get_vma_policy+0xa/0x30
[7674143.092018]  ? alloc_pages_vma+0x122/0x190
[7674143.093419]  do_huge_pmd_anonymous_page+0x135/0x590
[7674143.094838]  ? do_anonymous_page+0x39f/0x540
[7674143.096167]  __handle_mm_fault+0x8fd/0xa20
[7674143.097505]  handle_mm_fault+0x122/0x210
[7674143.098786]  __do_page_fault+0x1b7/0x470
[7674143.100017]  do_page_fault+0x32/0x140
[7674143.101224]  ? async_page_fault+0x8/0x30
[7674143.102457]  async_page_fault+0x1e/0x30
[7674143.103754] RIP: 0033:0x45fe23

问题原因

操作系统在做内存规整(内存碎片整理)的过程中被中断,该中断处理过程释放了内存规整需要操作的内存页,由于内存规整时CPU指令乱序原因,导致Capture Control没有完成初始化,还是空指针。因此中断处理过程中释放该内存页,操作到Capture Control结构时,空指针导致系统宕机。

解决方案

说明
  • 如果您对实例或数据有修改、变更等风险操作,务必注意实例的容灾、容错能力,确保数据安全。

  • 如果您对实例(包括但不限于ECS、RDS)等进行配置与数据修改,建议提前创建快照或开启RDS日志备份等功能。

  • 如果您在阿里云平台授权或者提交过登录账号、密码等安全信息,建议您及时修改。

当遇到该问题时,您可以参考以下方案处理:

  1. 登录实例,执行以下命令,确认系统内核版本适用此方案。

    uname -r

    系统显示类似如下。

    4.19.91-19.1.al7.x86_64
  2. 根据系统内核版本选择对应的解决方法:

    • 对于4.19.91-19.1.al7.x86_64(不含)之前的版本:

      1. 执行以下命令,将操作系统版本更新至最新的内核版本。

        yum update kernel
      2. 更新内核版本之后,需重启生效,请执行以下命令,重启服务器。

        reboot
      3. 若最新内核版本的操作系统同样存在该问题,请执行以下步骤,更新内核热补丁。

    • 对于4.19.91-19.1.al7.x86_64(包含)到4.19.91-21.al7.x86_64(包含)之间的版本,可通过安装内核热补丁解决,安装命令如下。

      yum install -y kernel-hotfix-5000697-`uname -r | awk -F"-" '{print $NF}'`

适用于

  • 云服务器ECS