问题描述
在符合如下条件的Alibaba Cloud Linux 2实例中,系统运行时出现系统宕机问题。
镜像:Alibaba Cloud Linux 2.1903 LTS 64位。
内核:kernel-4.19.91-23.al7及之前的内核版本。
系统宕机,且出现如下调用栈信息。
[ 332.057218] watchdog: BUG: soft lockup - CPU#7 stuck for 11s! [split_v2:28356]
[ 332.057219] mousedev isst_if_common hid_generic usbhid
[ 332.057223] CPU: 3 PID: 28336 Comm: split_v2 Kdump: loaded Not tainted 4.19.91-19.1.al7.x86_64 #1
[ 332.057507] Kernel panic - not syncing: softlockup: hung tasks
[ 332.057508] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 332.057510] CPU: 6 PID: 28355 Comm: split_v2 Kdump: loaded Tainted: G L 4.19.91-19.1.al7.x86_64 #1
[ 332.057513] cp_new_stat+0x13d/0x160
[ 332.057514] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 0000000000000019
[ 332.057515] Call Trace:
[ 332.057516] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014
[ 332.057518] __se_sys_newfstat+0x2e/0x40
[ 332.057518] Call Trace:
[ 332.057519] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 0f 01 cb 83
[ 332.057521] RBP: 00007eff1201bf10 R08: 00007eff1201c700 R09: 00007eff1201c700
[ 332.057523] do_syscall_64+0x5b/0x1b0
[ 332.057524] <IRQ>
[ 332.057525] RSP: 0018:ffffa389886efde8 EFLAGS: 00050206
[ 332.057529] dump_stack+0x66/0x8b
[ 332.057531] R10: 00007eff1201c9d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057534] panic+0xd8/0x24c
[ 332.057535] RAX: 000000c000100090 RBX: ffffa389886efea8 RCX: 0000000000000090
[ 332.057536] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff1201c700
[ 332.057539] __do_page_fault+0x11d/0x470
[ 332.057540] ? 0xffffffffc0477000
[ 332.057541] RDX: 0000000000000090 RSI: ffffa389886efdf8 RDI: 000000c000100000
[ 332.057552] watchdog_timer_fn+0x253/0x260
[ 332.057555] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057556] ? softlockup_fn+0x40/0x40
[ 332.057557] RBP: 000000c000100000 R08: 0000000000000000 R09: 0000000000000000
[ 332.057559] __hrtimer_run_queues+0xeb/0x250
[ 332.057560] R10: ffff8bfb1690a310 R11: ffff8bfb1f01a6c8 R12: ffff8bfaee04df00
[ 332.057562] hrtimer_interrupt+0x122/0x270
[ 332.057563] RIP: 0033:0x7eff1b11e3a4
[ 332.057564] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 332.057566] smp_apic_timer_interrupt+0x6a/0x140
[ 332.057568] do_page_fault+0x32/0x140
[ 332.057570] apic_timer_interrupt+0xf/0x20
[ 332.057572] _copy_to_user+0x22/0x30
[ 332.057573] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057574] </IRQ>
[ 332.057575] RSP: 002b:00007eff1181aed8 EFLAGS: 00000246
[ 332.057578] RIP: 0010:__do_page_fault+0x227/0x470
[ 332.057579] ORIG_RAX: 0000000000000005
[ 332.057580] Code: 00 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 f6 85 91 00 00 00 02 41 bf 14 00 00 00 0f 84 c5 fe ff ff fb 66 0f 1f 44 00 00 <e9> b9 fe ff ff f6 85 88 00 00 00 03 75 0d f6 85 92 00 00 00 04 0f
[ 332.057582] cp_new_stat+0x13d/0x160
[ 332.057583] RSP: 0018:ffffa389886f7ca0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 332.057585] __se_sys_newfstat+0x2e/0x40
[ 332.057586] RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffffffff93a00ae0
[ 332.057587] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057588] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffa389886f7d38
[ 332.057589] do_syscall_64+0x5b/0x1b0
[ 332.057590] RBP: ffffa389886f7d38 R08: 0000000000000000 R09: 0000000000000000
[ 332.057591] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 0000000000000009
[ 332.057592] R10: 0000000000000000 R11: 0000000000000000 R12: 000000c000100000
[ 332.057594] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057595] R13: ffff8bfb168bd940 R14: ffff8bfaee04af80 R15: 0000000000000014
[ 332.057597] RIP: 0033:0x7eff1b11e3a4
[ 332.057599] async_page_fault+0x1e/0x30
[ 332.057601] ? restore_regs_and_return_to_kernel+0x25/0x25
[ 332.057602] RBP: 00007eff1181af10 R08: 00007eff1181b700 R09: 00007eff1181b700
[ 332.057602] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057604] do_page_fault+0x32/0x140
[ 332.057606] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x20
[ 332.057607] R10: 00007eff1181b9d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057608] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 0f 01 cb 83
[ 332.057609] async_page_fault+0x1e/0x30
[ 332.057610] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff1181b700
[ 332.057612] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x20
[ 332.057613] RSP: 002b:00007eff08808ed8 EFLAGS: 00000246
[ 332.057614] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 0f 01 cb 83
[ 332.057615] ORIG_RAX: 0000000000000005
[ 332.057616] RSP: 0018:ffffa389886f7de8 EFLAGS: 00050206
[ 332.057617] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057618] RAX: 000000c000100090 RBX: ffffa389886f7ea8 RCX: 0000000000000090
[ 332.057619] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 0000000000000024
[ 332.057620] RDX: 0000000000000090 RSI: ffffa389886f7df8 RDI: 000000c000100000
[ 332.057621] RSP: 0018:ffffa389886ffde8 EFLAGS: 00050206
[ 332.057623] RBP: 000000c000100000 R08: 0000000000000000 R09: 0000000000000000
[ 332.057624] RBP: 00007eff08808f10 R08: 00007eff08809700 R09: 00007eff08809700
[ 332.057625] R10: ffff8bfb1690b810 R11: ffff8bfb1f01a6c8 R12: ffff8bfaee04af80
[ 332.057626] R10: 00007eff088099d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057627] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 332.057628] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff08809700
[ 332.057630] _copy_to_user+0x22/0x30
[ 332.057631] RAX: 000000c000100090 RBX: ffffa389886ffea8 RCX: 0000000000000090
[ 332.057632] cp_new_stat+0x13d/0x160
[ 332.057633] RDX: 0000000000000090 RSI: ffffa389886ffdf8 RDI: 000000c000100000
[ 332.057634] RBP: 000000c000100000 R08: 0000000000000000 R09: 0000000000000000
[ 332.057635] __se_sys_newfstat+0x2e/0x40
[ 332.057636] R10: ffff8bfb1690ad10 R11: ffff8bfb1f01a6c8 R12: ffff8bfaee048000
[ 332.057637] do_syscall_64+0x5b/0x1b0
[ 332.057638] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 332.057640] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057642] _copy_to_user+0x22/0x30
[ 332.057643] RIP: 0033:0x7eff1b11e3a4
[ 332.057645] cp_new_stat+0x13d/0x160
[ 332.057646] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057647] __se_sys_newfstat+0x2e/0x40
[ 332.057648] RSP: 002b:00007eff08007ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000005
[ 332.057651] do_syscall_64+0x5b/0x1b0
[ 332.057652] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057654] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 332.057655] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 000000000000002e
[ 332.057656] RIP: 0033:0x7eff1b11e3a4
[ 332.057657] RBP: 00007eff08007f10 R08: 00007eff08008700 R09: 00007eff08008700
[ 332.057658] Code: 00 f7 d8 64 89 02 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 83 ff 01 89 f0 77 19 48 63 f8 48 89 d6 b8 05 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 18 f3 c3 66 90 48 8b 05 99 7a 2d 00 64 c7 00
[ 332.057659] R10: 00007eff080089d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057660] RSP: 002b:00007eff07806ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000005
[ 332.057662] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff08008700
[ 332.057663] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff1b11e3a4
[ 332.057663] RDX: 000000c000100000 RSI: 000000c000100000 RDI: 000000000000001e
[ 332.057664] RBP: 00007eff07806f10 R08: 00007eff07807700 R09: 00007eff07807700
[ 332.057665] R10: 00007eff078079d0 R11: 0000000000000246 R12: 0000000000000000
[ 332.057665] R13: 0000000000801000 R14: 0000000000000000 R15: 00007eff07807700
问题原因
Alibaba Cloud Linux系统默认开启了THP(透明大页功能),GC内存回收时,会调用MADV_NOHUGEPAGE关闭大页,然后调用MADV_FREE释放部分4K页,并在操作系统中会切割THP大页。当其它进程内核Page Fault异常占用CPU资源时,导致切割THP大页的进程没有调度完成,而切割THP大页的进程无法完成,会导致Page Fault的进程一直无法结束,因此它们会一直等待对方结束进程,最终会导致SOFT LOCKUP。若Alibaba Cloud Linux实例中配置了/proc/sys/kernel/softlockup_panic
,SOFT LOCKUP的产生会触发内核宕机。
解决方案
如果您对实例或数据有修改、变更等风险操作,务必注意实例的容灾、容错能力,确保数据安全。
如果您对实例(包括但不限于ECS、RDS)等进行配置与数据修改,建议提前创建快照或开启RDS日志备份等功能。
如果您在阿里云平台授权或者提交过登录账号、密码等安全信息,建议您及时修改。
当遇到该问题时,您可以参考以下方案处理:
登录ECS实例,详情请参见连接方式概述。
执行以下命令,确认系统内核版本适用此方案。
uname -r
系统显示类似如下。
4.19.91-19.1.al7.x86_64
根据系统内核版本选择对应的解决方法:
对于4.19.91-19.1.al7.x86_64(不含)之前的版本:
执行以下命令,将操作系统版本更新至最新的内核版本。
yum update kernel
更新内核版本之后,需重启生效,请执行以下命令,重启服务器。
reboot
若最新内核版本的操作系统同样存在该问题,请执行以下步骤,更新内核热补丁。
对于4.19.91-19.1.al7.x86_64(包含)到4.19.91-23.al7.x86_64(包含)之间的版本,可通过安装内核热补丁解决,安装命令如下。
yum install -y kernel-hotfix-5902278-`uname -r | awk -F"-" '{print $NF}'`
适用于
云服务器ECS