本文中含有需要您注意的重要提示信息,忽略该信息可能对您的业务造成影响,请务必仔细阅读。
本文介绍了Alibaba Cloud Linux 2系统的ECS实例中Page Fault异常导致系统宕机的原因及解决方案。
问题描述
在满足以下条件的Alibaba Cloud Linux 2实例中,系统运行时发生宕机,并出现如下调用栈信息。
镜像:Alibaba Cloud Linux 2.1903 LTS 64位。
内核:
kernel-4.19.91-23.al7
及之前的版本。您可以通过uname -r
命令查看。
[ 332.057218] watchdog: BUG: soft lockup - CPU#7 stuck for 11s! [split_v2:28356]
[ 332.057219] mousedev isst_if_common hid_generic usbhid
[ 332.057223] CPU: 3 PID: 28336 Comm: split_v2 Kdump: loaded Not tainted 4.19.91-19.1.al7.x86_64 #1
[ 332.057507] Kernel panic - not syncing: softlockup: hung tasks
[ 332.057508] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 332.057510] CPU: 6 PID: 28355 Comm: split_v2 Kdump: loaded Tainted: G L 4.19.91-19.1.al7.x86_64 #1
[ 332.057513] cp_new_stat+0x13d/0x160
[ 332.057514] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 0000000000000019
[ 332.057515] Call Trace:
[ 332.057516] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014
[ 332.057518] __se_sys_newfstat+0x2e/0x40
[ 332.057518] Call Trace:
[ 332.057519] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 0f 01 cb 83
[ 332.057521] RBP: 00007eff1201bf10 R08: 00007eff1201c700 R09: 00007eff1201c700
[ 332.057523] do_syscall_64+0x5b/0x1b0
[ 332.057524] <IRQ>
[ 332.057525] RSP: 0018:ffffa389886efde8 EFLAGS: 00050206
[ 332.057529] dump_stack+0x66/0x8b
[ 332.057531] R10: 00007eff1201c9d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057534] panic+0xd8/0x24c
[ 332.057535] RAX: 000000c000100090 RBX: ffffa389886efea8 RCX: 0000000000000090
[ 332.057536] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff1201c700
[ 332.057539] __do_page_fault+0x11d/0x470
[ 332.057540] ? 0xffffffffc0477000
[ 332.057541] RDX: 0000000000000090 RSI: ffffa389886efdf8 RDI: 000000c000100000
[ 332.057552] watchdog_timer_fn+0x253/0x260
[ 332.057555] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057556] ? softlockup_fn+0x40/0x40
[ 332.057557] RBP: 000000c000100000 R08: 0000000000000000 R09: 0000000000000000
[ 332.057559] __hrtimer_run_queues+0xeb/0x250
[ 332.057560] R10: ffff8bfb1690a310 R11: ffff8bfb1f01a6c8 R12: ffff8bfaee04df00
[ 332.057562] hrtimer_interrupt+0x122/0x270
[ 332.057563] RIP: 0033:0x7eff1b11e3a4
[ 332.057564] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 332.057566] smp_apic_timer_interrupt+0x6a/0x140
[ 332.057568] do_page_fault+0x32/0x140
[ 332.057570] apic_timer_interrupt+0xf/0x20
[ 332.057572] _copy_to_user+0x22/0x30
[ 332.057573] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057574] </IRQ>
[ 332.057575] RSP: 002b:00007eff1181aed8 EFLAGS: 00000246
[ 332.057578] RIP: 0010:__do_page_fault+0x227/0x470
[ 332.057579] ORIG_RAX: 0000000000000005
[ 332.057580] Code: 00 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 f6 85 91 00 00 00 02 41 bf 14 00 00 00 0f 84 c5 fe ff ff fb 66 0f 1f 44 00 00 <e9> b9 fe ff ff f6 85 88 00 00 00 03 75 0d f6 85 92 00 00 00 04 0f
[ 332.057582] cp_new_stat+0x13d/0x160
[ 332.057583] RSP: 0018:ffffa389886f7ca0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 332.057585] __se_sys_newfstat+0x2e/0x40
[ 332.057586] RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffffffff93a00ae0
[ 332.057587] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057588] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffa389886f7d38
[ 332.057589] do_syscall_64+0x5b/0x1b0
[ 332.057590] RBP: ffffa389886f7d38 R08: 0000000000000000 R09: 0000000000000000
[ 332.057591] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 0000000000000009
[ 332.057592] R10: 0000000000000000 R11: 0000000000000000 R12: 000000c000100000
[ 332.057594] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057595] R13: ffff8bfb168bd940 R14: ffff8bfaee04af80 R15: 0000000000000014
[ 332.057597] RIP: 0033:0x7eff1b11e3a4
[ 332.057599] async_page_fault+0x1e/0x30
[ 332.057601] ? restore_regs_and_return_to_kernel+0x25/0x25
[ 332.057602] RBP: 00007eff1181af10 R08: 00007eff1181b700 R09: 00007eff1181b700
[ 332.057602] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057604] do_page_fault+0x32/0x140
[ 332.057606] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x20
[ 332.057607] R10: 00007eff1181b9d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057608] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 0f 01 cb 83
[ 332.057609] async_page_fault+0x1e/0x30
[ 332.057610] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff1181b700
[ 332.057612] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x20
[ 332.057613] RSP: 002b:00007eff08808ed8 EFLAGS: 00000246
[ 332.057614] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 0f 01 cb 83
[ 332.057615] ORIG_RAX: 0000000000000005
[ 332.057616] RSP: 0018:ffffa389886f7de8 EFLAGS: 00050206
[ 332.057617] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057618] RAX: 000000c000100090 RBX: ffffa389886f7ea8 RCX: 0000000000000090
[ 332.057619] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 0000000000000024
[ 332.057620] RDX: 0000000000000090 RSI: ffffa389886f7df8 RDI: 000000c000100000
[ 332.057621] RSP: 0018:ffffa389886ffde8 EFLAGS: 00050206
[ 332.057623] RBP: 000000c000100000 R08: 0000000000000000 R09: 0000000000000000
[ 332.057624] RBP: 00007eff08808f10 R08: 00007eff08809700 R09: 00007eff08809700
[ 332.057625] R10: ffff8bfb1690b810 R11: ffff8bfb1f01a6c8 R12: ffff8bfaee04af80
[ 332.057626] R10: 00007eff088099d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057627] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 332.057628] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff08809700
[ 332.057630] _copy_to_user+0x22/0x30
[ 332.057631] RAX: 000000c000100090 RBX: ffffa389886ffea8 RCX: 0000000000000090
[ 332.057632] cp_new_stat+0x13d/0x160
[ 332.057633] RDX: 0000000000000090 RSI: ffffa389886ffdf8 RDI: 000000c000100000
[ 332.057634] RBP: 000000c000100000 R08: 0000000000000000 R09: 0000000000000000
[ 332.057635] __se_sys_newfstat+0x2e/0x40
[ 332.057636] R10: ffff8bfb1690ad10 R11: ffff8bfb1f01a6c8 R12: ffff8bfaee048000
[ 332.057637] do_syscall_64+0x5b/0x1b0
[ 332.057638] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 332.057640] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057642] _copy_to_user+0x22/0x30
[ 332.057643] RIP: 0033:0x7eff1b11e3a4
[ 332.057645] cp_new_stat+0x13d/0x160
[ 332.057646] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057647] __se_sys_newfstat+0x2e/0x40
[ 332.057648] RSP: 002b:00007eff08007ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000005
[ 332.057651] do_syscall_64+0x5b/0x1b0
[ 332.057652] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057654] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057655] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 000000000000002e
[ 332.057656] RIP: 0033:0x7eff1b11e3a4
[ 332.057657] RBP: 00007eff08007f10 R08: 00007eff08008700 R09: 00007eff08008700
[ 332.057658] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057659] R10: 00007eff080089d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057660] RSP: 002b:00007eff07806ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000005
[ 332.057662] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff08008700
[ 332.057663] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057663] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 000000000000001e
[ 332.057664] RBP: 00007eff07806f10 R08: 00007eff07807700 R09: 00007eff07807700
[ 332.057665] R10: 00007eff078079d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057665] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff07807700
问题原因
Alibaba Cloud Linux系统默认开启了透明大页功能(THP)。在进行GC内存回收时,系统通过MADV_NOHUGEPAGE
关闭大页并通过调用MADV_FREE
释放部分4 K页,但切割THP大页的操作可能因其他进程占用CPU导致进程调度延迟,引发Page Fault异常的进程与THP切割进程相互等待,最终触发SOFT LOCKUP。若Alibaba Cloud Linux实例中配置了/proc/sys/kernel/softlockup_panic
,SOFT LOCKUP的产生会触发内核宕机。
解决方案
升级内核可能会出现兼容性和稳定性问题,建议您查看Alibaba Cloud Linux 2镜像发布记录了解具体内核功能后谨慎进行操作。
重启实例将导致您的实例暂停运行,这可能引发业务中断和数据丢失。因此,建议您在执行此操作之前备份关键数据,并选择在非业务高峰期进行。
4.19.91-23.al7.x86_64
及之前的版本。执行以下命令,升级内核到最新版本。
sudo yum update kernel
执行以下命令,重启实例使配置生效。
sudo reboot
4.19.91-19.1.al7.x86_64(包含)
~4.19.91-23.al7.x86_64(包含)
版本,执行以下命令,安装内核热补丁。sudo yum install -y kernel-hotfix-5902278-`uname -r | awk -F"-" '{print $NF}'`