本文中含有需要您注意的重要提示信息,忽略该信息可能对您的业务造成影响,请务必仔细阅读。
本文介绍了Alibaba Cloud Linux 2系统的ECS实例中断处理释放内存页时因访问空指针导致系统宕机的原因及解决方案。
问题描述
在满足以下条件的Alibaba Cloud Linux 2实例中,系统运行时发生宕机,并出现如下调用栈信息。
镜像:Alibaba Cloud Linux 2.1903 LTS 64位。
内核:
4.19.91-21.al7.x86_64
及之前的版本。您可以通过uname -r
命令查看。
[7674143.032169] general protection fault: 0000 [#1] SMP PTI
[7674143.033229] CPU: 2 PID: 23701 Comm: kube-state-metr Not tainted 4.19.91-19.1.al7.x86_64 #1
[7674143.034412] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014
[7674143.035489] RIP: 0010:free_one_page+0x2c4/0x440
[7674143.036259] Code: 89 e9 b8 ff ff ff ff 4d 89 70 08 d3 e0 31 f6 4c 89 ff 48 63 d0 e8 ac ff 01 00 eb b2 48 83 7a 08 00 0f 85 aa fd ff ff 48 8b 02 <4c> 3b 78 40 0f 85 9d fd ff ff 80 78 77 00 4c 0f 45 c2 e9 90 fd ff
[7674143.038795] RSP: 0000:ffff926a5fa83c10 EFLAGS: 00010046
[7674143.039723] RAX: fe9f1635d944f100 RBX: 0000000000000000 RCX: 0000000000000003
[7674143.040915] RDX: ffffa14a86fcfae0 RSI: ffffe2f184224a00 RDI: ffff926a7ffd7010
[7674143.042363] RBP: 0000000000000003 R08: 0000000000000000 R09: ffff92634892f80c
[7674143.043891] R10: 000000000000004e R11: 000000000000000c R12: 0000000000108928
[7674143.045435] R13: ffffe2f180000000 R14: ffffe2f184224a00 R15: ffff926a7ffd6b80
[7674143.046815] FS: 000000c0003a4090(0000) GS:ffff926a5fa80000(0000) knlGS:0000000000000000
[7674143.048132] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7674143.049236] CR2: 000000c005600000 CR3: 0000000449dd4006 CR4: 00000000003606e0
[7674143.050477] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[7674143.051744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[7674143.053011] Call Trace:
[7674143.053814] <IRQ>
[7674143.054614] __free_pages_ok+0x14e/0x2b0
[7674143.055653] page_to_skb.isra.70+0x2b2/0x310
[7674143.056774] receive_mergeable+0x3ba/0xb50
[7674143.057793] ? tcp_v4_rcv+0xc3b/0xdc0
[7674143.058765] receive_buf+0x2b9/0xa00
[7674143.059737] ? ip_local_deliver_finish+0x9f/0x1f0
[7674143.060877] ? detach_buf+0x68/0x110
[7674143.061877] virtnet_poll+0x141/0x320
[7674143.062896] net_rx_action+0x127/0x320
[7674143.063926] __do_softirq+0xd1/0x28c
[7674143.064985] irq_exit+0xd2/0xf0
[7674143.065975] do_IRQ+0x54/0xe0
[7674143.066967] common_interrupt+0xf/0xf
[7674143.068023] </IRQ>
[7674143.068952] RIP: 0010:compact_zone_order+0x6e/0xd0
[7674143.070134] Code: 95 c0 85 db 89 74 24 70 48 89 e6 89 84 24 80 00 00 00 0f 94 c0 65 48 8b 1c 25 00 4d 01 00 48 89 a3 a8 08 00 00 4c 89 64 24 50 <89> 54 24 6c 44 89 44 24 78 44 89 4c 24 7c 88 84 24 84 00 00 00 88
[7674143.073642] RSP: 0000:ffffa14a86fcfae0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffd7
[7674143.075240] RAX: 0000000000000000 RBX: ffff926590c15f00 RCX: 0000000000000000
[7674143.076813] RDX: 00000000002342ca RSI: ffffa14a86fcfae0 RDI: ffffa14a86fcfaf0
[7674143.078413] RBP: ffffa14a86fcfb80 R08: 0000000000000040 R09: 0000000000000002
[7674143.080007] R10: 0000000000000009 R11: 0000000000000000 R12: ffff926a7ffd6000
[7674143.081689] R13: 0000000000000009 R14: ffffa14a86fcfcf8 R15: ffff926a7ffd7ce0
[7674143.083332] try_to_compact_pages+0x19f/0x250
[7674143.084709] __alloc_pages_direct_compact+0x83/0x160
[7674143.086153] __alloc_pages_nodemask+0xdde/0xf60
[7674143.087579] ? tcp_schedule_loss_probe+0xe3/0x150
[7674143.089240] ? tcp_write_xmit+0x2c8/0xf40
[7674143.090679] ? get_vma_policy+0xa/0x30
[7674143.092018] ? alloc_pages_vma+0x122/0x190
[7674143.093419] do_huge_pmd_anonymous_page+0x135/0x590
[7674143.094838] ? do_anonymous_page+0x39f/0x540
[7674143.096167] __handle_mm_fault+0x8fd/0xa20
[7674143.097505] handle_mm_fault+0x122/0x210
[7674143.098786] __do_page_fault+0x1b7/0x470
[7674143.100017] do_page_fault+0x32/0x140
[7674143.101224] ? async_page_fault+0x8/0x30
[7674143.102457] async_page_fault+0x1e/0x30
[7674143.103754] RIP: 0033:0x45fe23
问题原因
在操作系统进行内存规整(内存碎片整理)的过程中发生了中断,该中断处理过程释放了内存规整所需操作的内存页。由于内存规整时CPU指令乱序的原因,导致Capture Control没有完成初始化,仍为一个空指针。因此,在中断处理过程中释放该内存页,操作到Capture Control结构时,空指针导致系统宕机。
解决方案
警告
升级内核可能会出现兼容性和稳定性问题,建议您查看Alibaba Cloud Linux 2镜像发布记录了解具体内核功能后谨慎进行操作。
重启实例将导致您的实例暂停运行,这可能引发业务中断和数据丢失。因此,建议您在执行此操作之前备份关键数据,并选择在非业务高峰期进行。
4.19.91-21.al7.x86_64
及之前的版本。执行以下命令,升级内核到最新版本。
sudo yum update kernel
执行以下命令,重启实例使配置生效。
sudo reboot
4.19.91-19.1.al7.x86_64(包含)
~4.19.91-21.al7.x86_64(包含)
版本,执行以下命令,安装内核热补丁。sudo yum install -y kernel-hotfix-5000697-`uname -r | awk -F"-" '{print $NF}'`
该文章对您有帮助吗?