Linux系统的ECS实例宕机问题排查

当Linux操作系统的ECS实例在运行过程中出现内核panic、内存溢出OOM(Out Of Memory)、蓝屏卡死等问题或收到系统事件通知实例出现操作系统崩溃时,说明该ECS实例发生宕机,您可以通过自助诊断工具或系统内核日志来定位问题并解决。

定位宕机原因

您可以通过以下方式,定位发生宕机的具体原因。

方式一:(推荐)通过自助诊断工具定位

  1. 登录ECS管理控制台,左侧导航栏单击自助问题排查

  2. 单击实例问题排查页签。

  3. 选择实例无法连接或启动异常 > 实例出现宕机,然后选择出现宕机的实例ID,单击开始排查。

    根据返回的诊断结果和修复方案,定位问题并解决。

方式二:通过系统事件定位

  1. 登录ECS管理控制台,左侧导航栏单击事件

  2. 在左侧导航栏单击非预期运维事件

  3. 单击发生宕机运维事件实例右侧的诊断操作系统错误根因,诊断实例宕机原因。

    根据返回的诊断结果和修复方案,定位问题并解决。

方式三:通过kdump查看内核日志定位

若您安装并配置了kdump,当系统发生宕机时,会生成vmcore-dmesg.txt文件,您可通过查看该文件获取宕机时的内核日志,并根据其中的calltrace信息(通常以"Call Trace:"开头)来定位问题的发生位置,分析宕机原因,从而进行修复和调试。

动手实践

如您想动手实践本文档的内容,请单击验证Guestos panic诊断能力

常见宕机原因和解决方案

实例宕机并产生日志“not syncing: Out of memory: system-wide panic_on_oom is enabled”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“not syncing: Out of memory: system-wide panic_on_oom is enabled”日志,调用栈类似如下:

    [3624965.306801] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
    [3624965.307824] CPU: 5 PID: 8510 Comm: AliDetect Kdump: loaded Tainted: GOE  ------------ T 3.10.0-1127.10.1.el7.x86_64 #1
    [3624965.308923] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [3624965.309671] Call Trace:
    [3624965.309935]  [<ffffffff8f37ffa5>] dump_stack+0x19/0x1b
    [3624965.310444]  [<ffffffff8f379541>] panic+0xe8/0x21f
    [3624965.310913]  [<ffffffff8edc26b5>] check_panic_on_oom+0x55/0x60
    [3624965.311480]  [<ffffffff8edc2aab>] out_of_memory+0x23b/0x4f0
    [3624965.312027]  [<ffffffff8f37b3e0>] __alloc_pages_slowpath+0x5db/0x729
    [3624965.312628]  [<ffffffff8edc91a6>] __alloc_pages_nodemask+0x436/0x450
    [3624965.313233]  [<ffffffff8ee18e78>] alloc_pages_current+0x98/0x110
    [3624965.313808]  [<ffffffff8edbe3d7>] __page_cache_alloc+0x97/0xb0
    [3624965.314364]  [<ffffffff8edc0f90>] filemap_fault+0x270/0x420
    [3624965.314912]  [<ffffffffc04ea7d6>] ext4_filemap_fault+0x36/0x50 [ext4]
    [3624965.315530]  [<ffffffff8ededf4a>] __do_fault.isra.61+0x8a/0x100
    [3624965.316095]  [<ffffffff8edee4fc>] do_read_fault.isra.63+0x4c/0x1b0
    [3624965.316680]  [<ffffffff8edf5d60>] handle_mm_fault+0xa20/0xfb0
    [3624965.317231]  [<ffffffff8f38d653>] __do_page_fault+0x213/0x500
    [3624965.317775]  [<ffffffff8f38da26>] trace_do_page_fault+0x56/0x150
    [3624965.318378]  [<ffffffff8f38cfa2>] do_async_page_fault+0x22/0xf0
    [3624965.318954]  [<ffffffff8f3897a8>] async_page_fault+0x28/0x30
  • 问题原因

    实例内存不足发生了OOM,且内核参数vm.panic_on_oom的值被设置为1或2。

    • 值为1时,表示内存不足时,有可能会触发kernel panic,也有可能启动OOM killer。

    • 值为2时,表示内存不足时,强制触发kernel panic。

  • 解决方案

    方案一:将内核参数vm.panic_on_oom设置为0

    您可以将内核参数vm.panic_on_oom设置为0,在内存不足时启动OOM killer来解决上述问题。

    重要

    更改vm.panic_on_oom的值为0可能会导致系统在内存不足时启动OOM killer,并终止占用大量内存的进程。这可能会对系统的稳定性和运行中的应用程序产生影响。因此,在进行此类更改之前,请确保了解其影响,并评估系统的内存管理和应用程序的需求。

    1. 远程连接ECS实例。

    2. 执行以下命令,打开文件/etc/sysctl.conf

      sudo vim /etc/sysctl.conf
    3. i键,修改为以下内容。

      vm.panic_on_oom = 0

      这将禁用系统在内存不足时发生崩溃。

    4. Ecs键,输入:wq,保存文件并退出编辑器。

    5. 执行以下命令以加载sysctl.conf中的更改。

      sudo sysctl -p

    方案二:优化内存使用

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

    OOM通常是由内存不足引起的,您可以根据业务情况判断内存使用是否合理,可以考虑以下方法来提高系统的内存容量,或减少内存使用:

    • 升级实例规格

      升级实例规格,您可以获得更多的内存资源。具体操作,请参见修改实例规格

    • 优化应用程序:

      检查应用程序的内存使用情况,并进行优化。例如,通过减少内存泄漏、优化算法或配置等方式。

实例宕机并产生日志“RIP: tcp_create_openreq_child”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中发生了宕机,产生日志“RIP: tcp_create_openreq_child”,调用栈类似如下:

    [8343753.027138] Oops: 0000 [#1] SMP PTI
    [8343753.027431] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           OE     5.4.0-122-generic #138-Ubuntu
    [8343753.028127] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [8343753.028728] RIP: 0010:tcp_create_openreq_child+0x2fd/0x410
    ...
    [8343753.036508] Call Trace:
    [8343753.036710]  <IRQ>
    [8343753.036886]  tcp_v4_syn_recv_sock+0x5a/0x400
    [8343753.037234]  tcp_get_cookie_sock+0x48/0x150
    [8343753.037564]  cookie_v4_check+0x581/0x6d0
    [8343753.037880]  tcp_v4_do_rcv+0x1a5/0x200
    [8343753.038184]  tcp_v4_rcv+0xc76/0xd10
    [8343753.038551]  ip_protocol_deliver_rcu+0x30/0x1b0
    [8343753.038980]  ip_local_deliver_finish+0x48/0x50
    [8343753.039335]  ip_local_deliver+0x73/0xf0
  • 问题原因

    操作系统内核版本Bug(例如内核中存在错误或缺陷),导致空指针引用错误,触发系统的保护机制,引起实例宕机。Bug详情

  • 解决方案

    将操作系统内核版本升级到5.4.0-123.139或更高版本。具体操作,请参见升级Linux ECS实例内核

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“sysrq_handle_crash”

  • 问题描述

    Linux操作系统的ECS实例在运行中宕机重启,产生日志“RIP: sysrq_handle_crash”,调用栈类似如下:

    [ 7262.769377] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_powerclamp iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper virtio_balloon shpchp cryptd parport_pc parport i2c_piix4 pcspkr ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_blk virtio_console cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel serio_raw drm ata_piix virtio_pci libata virtio_ring i2c_core virtio floppy
    [ 7262.774113] CPU: 1 PID: 3818 Comm: bash Not tainted 3.10.0-514.26.2.el7.x86_64 #1
    [ 7262.774699] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [ 7262.775317] task: ffff88040d3d5e20 ti: ffff8803cb7ac000 task.ti: ffff8803cb7ac000
    [ 7262.775904] RIP: 0010:[<ffffffff813ee1d6>]  [<ffffffff813ee1d6>] sysrq_handle_crash+0x16/0x20
    ...
    [ 7262.784790] Call Trace:
    [ 7262.784992]  [<ffffffff813ee9f7>] __handle_sysrq+0x107/0x170
    [ 7262.785450]  [<ffffffff813eee6f>] write_sysrq_trigger+0x2f/0x40
    [ 7262.785915]  [<ffffffff8126be0d>] proc_reg_write+0x3d/0x80
    [ 7262.786355]  [<ffffffff811fe9fd>] vfs_write+0xbd/0x1e0
    [ 7262.786759]  [<ffffffff811ff51f>] SyS_write+0x7f/0xe0
    [ 7262.787172]  [<ffffffff81697809>] system_call_fastpath+0x16/0x1b
  • 问题原因

    用户在实例内部使用以下命令主动触发了宕机:

    echo c > /proc/sysrq-trigger
  • 解决方案

    正常情况下,请不要执行 echo c > /proc/sysrq-trigger 触发宕机。

    重要

    执行echo c > /proc/sysrq-trigger后会触发内核崩溃并且立即重启,该命令通常用于测试或在无法通过正常方式关闭系统时强制崩溃内核。

实例宕机并产生日志“RIP:get_target_pstate_use_performance”

  • 问题描述

    Linux操作系统的ECS实例在运行中出现宕机,产生“RIP:get_target_pstate_use_performance”日志,调用栈类似如下:

    [    1.076899] divide error: 0000 [#1] SMP
    [    1.077669] Modules linked in:
    [    1.078302] CPU: 4 PID: 9 Comm: rcu_sched Not tainted 3.10.0-1127.19.1.el7.x86_64 #1
    [    1.079519] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014
    [    1.080724] task: ffff91c8fa111070 ti: ffff91c8fa11c000 task.ti: ffff91c8fa11c000
    [    1.081919] RIP: 0010:[<ffffffff85dc3089>]  [<ffffffff85dc3089>] get_target_pstate_use_performance+0x29/0xc0
    [    1.083355] RSP: 0000:ffff91c8fa11fb40  EFLAGS: 00010006
    [    1.093192] Call Trace:
    [    1.093715]  [<ffffffff85dc4081>] intel_pstate_update_util+0x161/0x310
    [    1.094550]  [<ffffffff858e9523>] ? load_balance+0x1a3/0xa10
    [    1.095321]  [<ffffffff858e4e87>] update_curr+0x127/0x1e0
    [    1.096123]  [<ffffffff858e52a8>] dequeue_entity+0x28/0x5c0
    [    1.096894]  [<ffffffff8586d3be>] ? kvm_sched_clock_read+0x1e/0x30
    [    1.097702]  [<ffffffff858e5893>] dequeue_task_fair+0x53/0x660
    [    1.098490]  [<ffffffff858debe5>] ? sched_clock_cpu+0x85/0xc0
    [    1.099266]  [<ffffffff858d7a56>] deactivate_task+0x46/0xd0
  • 问题原因

    该问题可能是由于ECS实例在启动过程中,Intel pstate驱动的current_pstate频率值被初始化为0造成的。在进程切换时,系统会调用Intel pstate来调节性能模式以适应系统负载的变化。当Intel pstate使用了current_pstate的0值,就可能导致除以零的运算错误,最终导致系统崩溃。

  • 解决方案

    将操作系统内核版本升级到4.18或更高版本。具体操作,请参见升级Linux ECS实例内核

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“not syncing: Out of memory and no killable processes”

  • 问题描述

    Linux操作系统的运行过程中出现了宕机,产生“not syncing: Out of memory and no killable processes”日志,调用栈类似于如下:

    [217894.026467] Out of memory: Kill process 17807 (php-fpm) score 4 or sacrifice child
    [217894.027560] Killed process 17807 (php-fpm) total-vm:386252kB, anon-rss:6972kB, file-rss:144kB, shmem-rss:9020kB
    [217894.910947] php-fpm invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
    [217894.912175] php-fpm cpuset=/ mems_allowed=0
    [217894.913100] CPU: 0 PID: 18534 Comm: php-fpm Tainted: GOE  ------------   3.10.0-957.21.3.el7.x86_64 #1
    [217894.914510] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [217894.915780] Call Trace:
    [217894.916607]  [<ffffffff8ff63107>] dump_stack+0x19/0x1b
    [217894.917775]  [<ffffffff8ff5db2a>] dump_header+0x90/0x229
    [217894.918914]  [<ffffffff8f901292>] ? ktime_get_ts64+0x52/0xf0
    [217894.919979]  [<ffffffff8f9584df>] ? delayacct_end+0x8f/0xb0
    [217894.921026]  [<ffffffff8f9ba834>] oom_kill_process+0x254/0x3d0
    [217894.922097]  [<ffffffff8f9ba2dd>] ? oom_unkillable_task+0xcd/0x120
    [217894.923248]  [<ffffffff8f9ba386>] ? find_lock_task_mm+0x56/0xc0
    [217894.924364]  [<ffffffff8f9bb076>] out_of_memory+0x4b6/0x4f0
    [217894.925513]  [<ffffffff8ff5e62e>] __alloc_pages_slowpath+0x5d6/0x724
  • 问题原因

    系统发生了内存不足,并且没有找到可终止的进程来释放内存,导致系统无法正常运行。

  • 解决方案

    您可以根据业务情况判断内存使用是否合理,可以考虑以下方法来提高系统的内存容量或减少内存使用:

    • 升级实例规格

      升级实例规格,获得更多的内存资源。具体操作,请参见修改实例规格

    • 优化应用程序

      检查ECS实例中占用内存过高的进程,判断内存使用是否合理,并进行优化。例如,减少内存泄漏、优化算法或配置等。

实例宕机并产生日志“RIP:__list_del_entry_valid.cold”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“list_del corruption, ffff91bc2ad47048->prev is LIST_POISON2 (dead000000000200)”日志,调用栈类似如下:

    [1072741.548729] list_del corruption, ffff91bc2ad47048->prev is LIST_POISON2 (dead000000000200)
    [1072741.549507] ------------[ cut here ]------------
    [1072741.549886] kernel BUG at lib/list_debug.c:50!
    [1072741.550275] invalid opcode: 0000 [#1] SMP PTI
    [1072741.550646] CPU: 0 PID: 1583643 Comm: kworker/0:1 Tainted: G           OE    --------- -  - 4.18.0-305.3.1.el8.x86_64 #1
    [1072741.551468] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [1072741.552048] Workqueue: cgroup_destroy css_release_work_fn
    [1072741.552462] RIP: 0010:__list_del_entry_valid.cold.1+0x45/0x4c
    ...
    [1072741.560426] Call Trace:
    [1072741.560638]  css_release_work_fn+0x3f/0x240
    [1072741.560983]  process_one_work+0x1a7/0x360
    [1072741.561300]  worker_thread+0x30/0x390
    [1072741.561622]  ? create_worker+0x1a0/0x1a0
    [1072741.561933]  kthread+0x116/0x130
    [1072741.562195]  ? kthread_flush_work_fn+0x10/0x10
    [1072741.562557]  ret_from_fork+0x35/0x40
    [1072741.562843] Modules linked in: AliSecGuard(OE) nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl joydev pcspkr virtio_balloon i2c_piix4 ip_tables xfs libcrc32c ata_generic cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ata_piix libata crc32c_intel virtio_net net_failover serio_raw failover virtio_console virtio_blk
    [1072741.566968] Features: eBPF/event
    [1072741.567302] ---[ end trace 8f40bd2bf2a072e5 ]---
  • 问题原因

    操作系统内核版本Bug:list_del发生错误LIST_POISON2 (dead000000000200)引发的宕机。Bug详情

  • 解决方案

    将操作系统内核版本升级到kernel-4.18.0-305.12.1.el8_4或更高版本。具体操作,请参见升级Linux ECS实例内核

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“RIP:module_put”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“RIP:module_put”日志,调用栈类似如下:

    [86389.969666] CPU: 2 PID: 1426 Comm: Syn-1203-Tx Tainted: GOE  ------------   3.10.0-1160.53.1.el7.x86_64 #1
    [86389.970626] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [86389.971377] task: ffff983118bfc200 ti: ffff982defd58000 task.ti: ffff982defd58000
    [86389.972034] RIP: 0010:[<ffffffff8c91956d>]  [<ffffffff8c91956d>] module_put+0x1d/0x80
    ...
    [86389.979170] Call Trace:
    [86389.979378]  [<ffffffff8ca53b40>] cdev_put+0x20/0x30
    [86389.979768]  [<ffffffff8ca5098f>] __fput+0x1ef/0x230
    [86389.980151]  [<ffffffff8ca50abe>] ____fput+0xe/0x10
    [86389.980526]  [<ffffffff8c8c299b>] task_work_run+0xbb/0xe0
    [86389.980946]  [<ffffffff8c8a1954>] do_exit+0x2d4/0xa30
    [86389.981375]  [<ffffffff8c91358f>] ? futex_wait+0x11f/0x280
  • 问题原因

    系统进程使用或访问已被释放的内存,引发了use-after-free漏洞,触发操作系统的保护机制或导致数据混乱,从而导致系统崩溃。

    说明

    Use-after-free是一种常见的软件漏洞类型,它发生在程序错误地使用或访问已经释放的内存时。这种情况可能会导致不可预测的行为,例如崩溃、数据损坏、数据泄露或执行恶意代码。

  • 解决方案

    将操作系统内核版本升级到kernel-4.18.0-305.12.1.el8_4或更高版本。具体操作,请参见升级Linux ECS实例内核

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“containerd: page allocation failure”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“containerd: page allocation failure”日志,调用栈类似如下:

    [1558839.130515] ------------[ cut here ]------------
    [1558839.131215] kernel BUG at lib/idr.c:1163!
    [1558839.131797] invalid opcode: 0000 [#1] SMP 
    [1558839.132411] Modules linked in: binfmt_misc AliSecGuard(OE) AliSecProcFilter64(OE) AliSecNetFlt64(OE) xt_CT xt_multiport ipt_rpfilter iptable_raw ip_set_hash_net ip_set_hash_ip ipip tunnel4 ip_tunnel veth ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables iptable_mangle nf_conntrack_netlink xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_mark xt_addrtype xt_set ip_set_bitmap_port ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set nfnetlink dummy xt_comment iptable_nat nf_nat_ipv4 nf_nat iptable_filter tcp_diag inet_diag overlay(T) sunrpc nfit ppdev libnvdimm iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev virtio_balloon pcspkr parport_pc parport i2c_piix4 nf_conntrack_ipv4 nf_defrag_ipv4 ip_vs_sh ip_vs_wrr
    [1558839.141715]  ip_vs_rr ip_vs nf_conntrack libcrc32c br_netfilter bridge stp llc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_console virtio_blk cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci virtio_ring floppy virtio drm_panel_orientation_quirks
    [1558839.147553] CPU: 6 PID: 21465 Comm: kworker/6:0 Tainted: G           OE  ------------ T 3.10.0-957.21.3.el7.x86_64 #1
    [1558839.149181] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [1558839.150656] Workqueue: events free_work
    [1558839.151766] task: ffff8fbc4d6e9040 ti: ffff8fb8b898c000 task.ti: ffff8fb8b898c000
    [1558839.153196] RIP: 0010:[<ffffffff967774e1>]  [<ffffffff967774e1>] ida_simple_remove+0x41/0x50
    ...
    [1558839.171901] Call Trace:
    [1558839.173133]  [<ffffffff966306c4>] __mem_cgroup_free+0x234/0x250
    [1558839.174750]  [<ffffffff966306f5>] free_work+0x15/0x20
    [1558839.176259]  [<ffffffff964b9ebf>] process_one_work+0x17f/0x440
    [1558839.177872]  [<ffffffff964baf56>] worker_thread+0x126/0x3c0
    [1558839.179421]  [<ffffffff964bae30>] ? manage_workers.isra.25+0x2a0/0x2a0
    [1558839.181092]  [<ffffffff964c1da1>] kthread+0xd1/0xe0
    [1558839.182839]  [<ffffffff964c1cd0>] ? insert_kthread_work+0x40/0x40
    [1558839.184543]  [<ffffffff96b75c37>] ret_from_fork_nospec_begin+0x21/0x21
    [1558839.186238]  [<ffffffff964c1cd0>] ? insert_kthread_work+0x40/0x40
    ...
  • 问题原因

    操作系统内核版本Bug:在开启memory control group的情况下,memcg_caches[]数组会增加每个已注册的内核内存缓存。如果没有可用的内存,即发生了内存不足事件,可能会导致系统崩溃。

  • 解决方案

    CentOS 7.7建议升级到kernel-3.10.0-1062.el7及以上版本,CentOS 7.6建议升级到kernel-3.10.0-957.27.2.el7及以上版本。具体操作,请参见升级Linux ECS实例内核

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“RIP:blk_mq_rq_timed_out”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“RIP:blk_mq_rq_timed_out”日志,调用栈类似如下:

    [8837401.113325] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
    [8837401.114219] IP: [<ffffffffae575638>] blk_mq_rq_timed_out+0x18/0xa0
    [8837401.114892] PGD 8000000885d08067 PUD e1beda067 PMD 0 
    [8837401.115471] Oops: 0000 [#1] SMP 
    [8837401.115855] Modules linked in: AliSecNetFlt64(OE) AliSecGuard(OE) AliSecProcFilter64(OE) xt_multiport veth ipt_rpfilter ip6t_rpfilter ip6t_MASQUERADE nf_nat_masquerade_ipv6 xt_set iptable_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_filter ip6table_raw ip6_tables ip_set_hash_ip ip_set_hash_net ip_set sch_htb xt_nat xt_statistic ipt_REJECT nf_reject_ipv4 nf_tables iptable_mangle xt_comment xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat tcp_diag inet_diag nfsv3 nfs_acl nfs lockd grace fscache overlay(T) sunrpc nfit libnvdimm iosf_mbi crc32_pclmul ppdev virtio_balloon joydev ghash_clmulni_intel parport_pc aesni_intel parport lrw gf128mul glue_helper i2c_piix4 ablk_helper pcspkr cryptd ip_vs_rr ip_vs_sh ip_vs_wrr ip_vs nf_conntrack ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net net_failover virtio_console virtio_blk failover cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci virtio_ring floppy drm_panel_orientation_quirks virtio libcrc32c br_netfilter bridge stp llc [last unloaded: AliSecNetFlt64]
    [8837401.130281] CPU: 0 PID: 163944 Comm: kworker/0:1H Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.80.1.el7.x86_64 #1
    [8837401.133029] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014
    [8837401.134621] Workqueue: kblockd blk_mq_timeout_work
    [8837401.135916] task: ffff88258a0b6300 ti: ffff8820c2b9c000 task.ti: ffff8820c2b9c000
    [8837401.137422] RIP: 0010:[<ffffffffae575638>]  [<ffffffffae575638>] blk_mq_rq_timed_out+0x18/0xa0
    [8837401.139091] RSP: 0018:ffff8820c2b9fd18  EFLAGS: 00010246
    [8837401.140371] RAX: 0000000000000000 RBX: ffff8819b6ad0000 RCX: 0000000000000000
    [8837401.141838] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8819b6ad0000
    [8837401.143314] RBP: ffff8820c2b9fd20 R08: 000000030ec11230 R09: df98ad67960c8828
    [8837401.144732] R10: df98ad67960c8828 R11: ffff8822d9e17f00 R12: ffff8819b6863240
    [8837401.146161] R13: 0000000000000002 R14: 0000000000000020 R15: 0000000000000002
    [8837401.147605] FS:  0000000000000000(0000) GS:ffff8829bfc00000(0000) knlGS:0000000000000000
    [8837401.149177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [8837401.150426] CR2: 00000000000000d0 CR3: 00000003e570a000 CR4: 00000000003606f0
    [8837401.151844] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [8837401.153287] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [8837401.154667] Call Trace:
    [8837401.155579]  [<ffffffffae57572c>] blk_mq_check_expired+0x6c/0x80
    [8837401.157057]  [<ffffffffae578dac>] bt_iter+0x5c/0x70
    [8837401.158357]  [<ffffffffae57984b>] blk_mq_queue_tag_busy_iter+0x13b/0x320
    [8837401.159675]  [<ffffffffae2e84c9>] ? pick_next_entity+0xa9/0x190
    [8837401.160968]  [<ffffffffae5756c0>] ? blk_mq_rq_timed_out+0xa0/0xa0
    [8837401.162414]  [<ffffffffae5756c0>] ? blk_mq_rq_timed_out+0xa0/0xa0
    [8837401.163748]  [<ffffffffae57428b>] blk_mq_timeout_work+0x8b/0x180
    [8837401.165062]  [<ffffffffae2c319f>] process_one_work+0x17f/0x440
    [8837401.166329]  [<ffffffffae2c42e6>] worker_thread+0x126/0x3c0
    [8837401.167541]  [<ffffffffae2c41c0>] ? manage_workers.isra.26+0x2b0/0x2b0
    [8837401.169048]  [<ffffffffae2cb4d1>] kthread+0xd1/0xe0
    [8837401.170311]  [<ffffffffae2cb400>] ? insert_kthread_work+0x40/0x40
    [8837401.171514]  [<ffffffffae9c51f7>] ret_from_fork_nospec_begin+0x21/0x21
    [8837401.172861]  [<ffffffffae2cb400>] ? insert_kthread_work+0x40/0x40
    [8837401.174091] Code: 83 84 c6 80 00 00 00 01 e8 f6 fe ff ff 5d c3 cc cc cc cc 0f 1f 44 00 00 55 48 89 e5 53 48 8b 57 58 48 8b 47 38 48 89 fb 83 e2 02 <48> 8b 80 d0 00 00 00 74 4c 48 83 78 10 00 74 50 48 ba 00 00 00 
    [8837401.178255] RIP  [<ffffffffae575638>] blk_mq_rq_timed_out+0x18/0xa0
    [8837401.179436]  RSP <ffff8820c2b9fd18>
    [8837401.180300] CR2: 00000000000000d0
  • 问题原因

    操作系统内核版本Bug:程序访问了空指针,触发内存访问错误,从而导致实例崩溃宕机。BUG详情

  • 解决方案

    将操作系统内核升级到kernel-3.10.0-1160.88.1.el7以上版本。具体操作,请参见升级Linux ECS实例内核

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“RIP:strnlen”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“RIP:strnlen”日志,调用栈类似如下:

    [86390.829326] BUG: unable to handle kernel paging request at 0000000100620100
    [86390.829510] IP: [<ffffffff9ed7f2ad>] strnlen+0xd/0x40
    [86390.829632] PGD 0 
    [86390.829685] Oops: 0000 [#1] SMP 
    [86390.829766] Modules linked in: AliSecGuard(OE) binfmt_misc xt_conntrack iptable_filter iptable_nat nf_nat_ipv4 arc4 emp(OE) nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat nf_conntrack eudp(E) libcrc32c ppdev intel_powerclamp crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc virtio_balloon parport i2c_piix4 pcspkr ip_tables ext4 mbcache jbd2 cirrus drm_kms_helper syscopyarea sysfillrect virtio_net virtio_console virtio_blk sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common drm crc32c_intel serio_raw floppy virtio_pci virtio_ring virtio drm_panel_orientation_quirks
    [86390.831199] CPU: 2 PID: 1311 Comm: KeepAlive Tainted: G           OE  ------------   3.10.0-957.el7.x86_64 #1
    [86390.831410] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 9e9f1cc 04/01/2014
    [86390.831580] task: ffff97c77add9040 ti: ffff97c77ade0000 task.ti: ffff97c77ade0000
    [86390.831742] RIP: 0010:[<ffffffff9ed7f2ad>]  [<ffffffff9ed7f2ad>] strnlen+0xd/0x40
    ......
    [86390.833643] Call Trace:
    [86390.833699]  [<ffffffff9ed8105b>] string.isra.7+0x3b/0xf0
    [86390.833805]  [<ffffffff9ed82771>] vsnprintf+0x201/0x6a0
    [86390.833908]  [<ffffffff9ed82c1d>] vscnprintf+0xd/0x30
    [86390.834011]  [<ffffffff9ea9a24b>] vprintk_emit+0x11b/0x510
    [86390.834143]  [<ffffffff9ea9a8a9>] ? vprintk_default+0x29/0x40
    [86390.834277]  [<ffffffff9ed77ef0>] ? kobject_put+0x50/0x60
    [86390.834407]  [<ffffffff9ea9a65f>] vprintk+0x1f/0x30
    [86390.834517]  [<ffffffff9ea975ef>] __warn+0x7f/0x100
    [86390.834618]  [<ffffffff9ea976cf>] warn_slowpath_fmt+0x5f/0x80
    [86390.834746]  [<ffffffffc02e2b64>] ? close_eudp_mmap_dev+0x1b4/0x200 [eudp]
    [86390.834896]  [<ffffffff9ed77ef0>] kobject_put+0x50/0x60
    [86390.835013]  [<ffffffff9ec466f8>] cdev_put+0x18/0x30
    [86390.835125]  [<ffffffff9ec4350a>] __fput+0x21a/0x260
    [86390.835232]  [<ffffffff9ec4363e>] ____fput+0xe/0x10
    [86390.835340]  [<ffffffff9eabe79b>] task_work_run+0xbb/0xe0
    [86390.835459]  [<ffffffff9ea9dc61>] do_exit+0x2d1/0xa40
    [86390.835568]  [<ffffffff9ea9e44f>] do_group_exit+0x3f/0xa0
    [86390.835695]  [<ffffffff9eaaf24e>] get_signal_to_deliver+0x1ce/0x5e0
    [86390.835830]  [<ffffffff9ea2b527>] do_signal+0x57/0x6f0
    [86390.835942]  [<ffffffff9eac57e0>] ? hrtimer_get_res+0x50/0x50
    [86390.836068]  [<ffffffff9ea2bc32>] do_notify_resume+0x72/0xc0
    [86390.836202]  [<ffffffff9f175124>] int_signal+0x12/0x17
    ...
  • 问题原因

    系统安装了第三方模块eudp,该模块存在Bug(例如传递给strnlen函数的参数不正确),导致实例宕机。

  • 解决方案

    建议您卸载第三方模块eudp。

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“RIP:filp_close”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“RIP:filp_close”日志,调用栈类似如下:

    [ 1891.552008] BUG: unable to handle kernel NULL pointer dereference at 0000000000000036
    [ 1891.552149] IP: [<ffffffff8801c67e>] filp_close+0xe/0x90
    [ 1891.552239] PGD 40819b067 PUD 40819a067 PMD 0 
    [ 1891.552321] Oops: 0000 [#1] SMP 
    [ 1891.552380] Modules linked in: AliSecGuard(OE) AliSecNetFlt64(OE) tampercore(OE) tampercfg(OE) ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_powerclamp crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc parport i2c_piix4 shpchp virtio_balloon pcspkr ip_tables ext4 mbcache jbd2 cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_net virtio_console virtio_blk drm crct10dif_pclmul crct10dif_common virtio_pci crc32c_intel virtio_ring i2c_core serio_raw virtio floppy
    [ 1891.553945] CPU: 3 PID: 2778 Comm: AliHips Tainted: G           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
    [ 1891.554107] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 9e9f1cc 04/01/2014
    [ 1891.554228] task: ffff88d4cd7e4f10 ti: ffff88d4c5af8000 task.ti: ffff88d4c5af8000
    [ 1891.554346] RIP: 0010:[<ffffffff8801c67e>]  [<ffffffff8801c67e>] filp_close+0xe/0x90
    ......
    [ 1891.555727] Call Trace:
    [ 1891.555772]  [<ffffffffc08d0d7c>] is_pathsite+0x1ac/0x400 [tampercore]
    [ 1891.555878]  [<ffffffff88055e1a>] ? bh_lru_install+0x18a/0x1e0
    [ 1891.555974]  [<ffffffff880563fc>] ? __find_get_block+0xbc/0x120
    [ 1891.556069]  [<ffffffff8805648d>] ? __getblk+0x2d/0x300
    [ 1891.556160]  [<ffffffffc02d956b>] ? search_dir+0x8b/0x120 [ext4]
    [ 1891.556258]  [<ffffffff87ebeed5>] ? wake_up_bit+0x25/0x30
    [ 1891.556345]  [<ffffffff88055b2d>] ? __brelse+0x3d/0x50
    [ 1891.556432]  [<ffffffffc02d9a69>] ? ext4_find_entry+0x299/0x570 [ext4]
    [ 1891.556536]  [<ffffffff880380cd>] ? __d_instantiate+0x2d/0xe0
    [ 1891.556629]  [<ffffffff88037446>] ? _d_rehash+0x36/0x40
    [ 1891.556712]  [<ffffffff88037473>] ? d_rehash+0x23/0x40
    [ 1891.556795]  [<ffffffff8803866c>] ? d_splice_alias+0xdc/0x120
    [ 1891.556891]  [<ffffffffc02da368>] ? ext4_lookup+0x118/0x170 [ext4]
    [ 1891.556993]  [<ffffffff8802b2b3>] ? lookup_fast+0xb3/0x230
    [ 1891.557080]  [<ffffffff8802ca48>] ? link_path_walk+0x238/0x8b0
    [ 1891.558026]  [<ffffffff8809769b>] ? proc_pid_permission+0x9b/0xc0
    [ 1891.558976]  [<ffffffff8802dfea>] ? path_lookupat+0x7a/0x8b0
    [ 1891.559917]  [<ffffffffc08d20db>] tamperhack_mkdir.part.4+0x12b/0x190 [tampercore]
    [ 1891.560888]  [<ffffffffc08d2185>] tamperhack_mkdir+0x45/0x50 [tampercore]
    [ 1891.561828]  [<ffffffff8852579b>] system_call_fastpath+0x22/0x27
    [ 1891.562736] Code: ff 00 00 00 00 e9 d3 fe ff ff 0f 1f 00 b8 ea ff ff ff eb 9d e8 c4 7c e7 ff 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 <48> 8b 47 38 48 89 fb 48 85 c0 74 5b 48 8b 47 28 49 89 f4 48 85 
    [ 1891.564925] RIP  [<ffffffff8801c67e>] filp_close+0xe/0x90
  • 问题原因

    系统安装了第三方模块Tampercore,该模块存在Bug,导致filp_close函数调用时发生了错误,进而导致实例宕机。

  • 解决方案

    建议您卸载或升级第三方模块Tampercore。

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“VFS: Unable to mount root fs on unknown-block”

  • 问题描述

    Linux操作系统的ECS实例在启动过程中出现循环宕机,无法正常进入系统,产生“VFS: Unable to mount root fs on unknown-block”日志,调用栈类似如下:

    [    1.573197] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
    [    1.574179] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 3.10.0-1160.6.1.el7.x86_64 #1
    [    1.575045] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014
    [    1.575900] Call Trace:
    [    1.576246]  [<ffffffff8f381400>] dump_stack+0x19/0x1b
    [    1.576845]  [<ffffffff8f37a958>] panic+0xe8/0x21f
    [    1.577433]  [<ffffffff8f98b794>] mount_block_root+0x291/0x2a0
    [    1.578122]  [<ffffffff8f98b7f6>] mount_root+0x53/0x56
    [    1.578719]  [<ffffffff8f98b935>] prepare_namespace+0x13c/0x174
    [    1.579425]  [<ffffffff8f98b412>] kernel_init_freeable+0x222/0x249
    [    1.580150]  [<ffffffff8f98ab28>] ? initcall_blacklist+0xb0/0xb0
    [    1.580838]  [<ffffffff8f36fa90>] ? rest_init+0x80/0x80
    [    1.581462]  [<ffffffff8f36fa9e>] kernel_init+0xe/0x100
    [    1.582073]  [<ffffffff8f394df7>] ret_from_fork_nospec_begin+0x21/0x21
    [    1.582814]  [<ffffffff8f36fa90>] ? rest_init+0x80/0x80
  • 问题原因

    内核升级被中断或出错,导致根文件系统(rootfs)被损坏,ECS实例在启动过程中找不到根分区的文件系统,进而导致实例宕机。

  • 解决方案

    建议您为ECS实例更换系统盘,或者基于已创建的快照回滚云盘。具体操作,请参见更换操作系统(系统盘)使用快照回滚云盘

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“RIP:virtio_check_driver_offered_feature”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“RIP:virtio_check_driver_offered_feature”日志,调用栈类似如下:

    [55686.388353] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
    [55686.389223] IP: [<ffffffffc0047450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio]
    [55686.390030] PGD 229af2067 PUD 21cbac067 PMD 0 
    [55686.390514] Oops: 0000 [#1] SMP 
    [55686.390867] Modules linked in: unix_diag AliSecGuard(OE) udp_diag tcp_diag inet_diag joydev binfmt_misc xfs libcrc32c dm_mod kvm_amd kvm irqbypass crc32_pclmul ppdev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper parport_pc ablk_helper cryptd virtio_balloon pcspkr parport i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_blk virtio_console cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common ata_piix crc32c_intel virtio_pci libata serio_raw virtio_ring virtio drm_panel_orientation_quirks floppy
    [55686.396603] CPU: 0 PID: 19222 Comm: fdisk Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.1.2.el7.x86_64 #1
    [55686.397848] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014
    [55686.398578] task: ffff964836e8e2a0 ti: ffff964860370000 task.ti: ffff964860370000
    [55686.399303] RIP: 0010:[<ffffffffc0047450>]  [<ffffffffc0047450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio]
    ....
    [55686.406216] Call Trace:
    [55686.406473]  [<ffffffffc0102b4c>] virtblk_ioctl+0x3c/0x70 [virtio_blk]
    [55686.407098]  [<ffffffff955608b5>] __blkdev_driver_ioctl+0x25/0x40
    [55686.407697]  [<ffffffffc03b5024>] dm_blk_ioctl+0x74/0xb0 [dm_mod]
    [55686.408289]  [<ffffffff955612fa>] blkdev_ioctl+0x28a/0xa20
    [55686.408817]  [<ffffffff95488771>] block_ioctl+0x41/0x50
    [55686.409319]  [<ffffffff9545d9e0>] do_vfs_ioctl+0x3a0/0x5a0
    [55686.409845]  [<ffffffff95305a82>] ? ktime_get+0x52/0xe0
    [55686.410345]  [<ffffffff955024ec>] ? security_file_ioctl+0x1c/0x20
    [55686.410930]  [<ffffffff9545dc81>] SyS_ioctl+0xa1/0xc0
    [55686.411429]  [<ffffffff9598cede>] system_call_fastpath+0x25/0x2a
    [55686.411999] Code: d5 89 de 48 c7 c7 e0 93 04 c0 e8 4c 98 53 d5 5b 5d c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 8b 8f a0 00 00 00 48 89 e5 <8b> 91 90 00 00 00 85 d2 74 2c 48 8b 81 88 00 00 00 39 30 74 59 
    [55686.414738] RIP  [<ffffffffc0047450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio]
  • 问题原因

    实例使用了逻辑卷管理(LVM),且一个逻辑卷(LV)关联到了设备(假设为vdc),但实际上该设备已被删除。由于LVM中仍然保留了对应设备的配置信息,当执行涉及该设备的命令(如blkidfdisk)时,会导致实例崩溃。

  • 解决方案

    • 方案一:使用LVM命令删除不存在的设备的配置,以使LVM中的配置与实际设备一致。

    • 方案二:升级内核版本至kernel-3.10.0-1160.6.1.el7以上。具体操作,请参见升级Linux ECS实例内核

实例宕机并产生日志“Out of memory and no killable processes”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“Out of memory and no killable processes”日志,调用栈类似如下:

    [28663.625353] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
    [28663.625363] [ 1799]     0  1799    26512      245      56       3        0         -1000 sshd
    [28663.625367] [29219]     0 29219    10832      126      26       3        0         -1000 systemd-udevd
    [28663.625375] Kernel panic - not syncing: Out of memory and no killable processes...
    [28663.634374] CPU: 1 PID: 3578 Comm: kworker/u176:4 Tainted: G           OE   3.10.0-1062.9.1.el7.x86_64 #1
    [28663.676873] Call Trace:
    [28663.679312]  [<ffffffff8139f342>] dump_stack+0x63/0x81
    [28663.684421]  [<ffffffff811b2245>] panic+0xf8/0x244
    [28663.689184]  [<ffffffff811b98db>] out_of_memory+0x2eb/0x550
    [28663.694726]  [<ffffffff811be254>] __alloc_pages_may_oom+0x114/0x1c0
    [28663.700959]  [<ffffffff811bedb3>] __alloc_pages_slowpath+0x7d3/0xa40
    [28663.707279]  [<ffffffff811bf229>] __alloc_pages_nodemask+0x209/0x260
    [28663.713599]  [<ffffffff81216535>] alloc_pages_current+0x95/0x140
    [28663.719573]  [<ffffffff811ba5ee>] __get_free_pages+0xe/0x40
    [28663.725113]  [<ffffffff81075dae>] pgd_alloc+0x1e/0x160
    [28663.730225]  [<ffffffff810875e4>] mm_init+0x184/0x240
    [28663.735249]  [<ffffffff81088102>] mm_alloc+0x52/0x60
    [28663.740186]  [<ffffffff81257640>] do_execveat_common.isra.37+0x250/0x780
    [28663.759839]  [<ffffffff81257b9c>] do_execve+0x2c/0x30
    [28663.764864]  [<ffffffff810a231b>] call_usermodehelper_exec_async+0xfb/0x150
    [28663.777246]  [<ffffffff81741dd9>] ret_from_fork+0x39/0x50
  • 问题原因

    操作系统内核分配内存失败后,尝试通过kill进程来释放内存,但系统没有可被kill的进程,进而触发了系统的主动宕机。出现该问题的可能原因有:

    • 系统内核存在内存泄漏,从而导致系统可用内存不足。

    • oom_score_adj-1000的进程占用过多内存,该类进程无法被终止从而导致系统可用内存不足。

      说明

      oom_score_adj是一个用于调整OOM(Out of Memory)终止进程的优先级的参数。内核根据每个进程的OOM分数(oom_score)来选择要终止的进程,较低的oom_score值表示进程更有可能被终止,而较高的值表示进程更不可能被终止。

  • 解决方案

    1. 检查系统内核是否存在内存泄漏。

      具体操作,请参见如何排查slab_unreclaimable内存占用高的原因?

    2. 检查进程的oom_score_adj设置是否合理。

      1. 执行以下命令,获取进程的PID。您可以使用命令如 pstoppgrep 来查找进程的 PID。

        ps aux | grep <进程名称>

        您需要将 <进程名称> 替换为您要查找的进程的名称。

      2. 执行以下命令,检查 oom_score_adj 设置。

        cat /proc/<PID>/oom_score_adj

        您需要将 <PID> 替换为已获取的进程实际PID。

        根据您的环境和需求,可以根据oom_score_adj的值来评估进程的OOM行为是否合理。如果oom_score_adj的值为-1000,则表示该进程具有较高的优先级,更不可能被内核选择进行OOM终止,从而导致系统可用内存不足。

实例宕机并产生日志“Objects remaining in kmalloc”

  • 问题描述

    当您在ECS实例内使用memory cgroup kmem功能时,内核有类似于如下所示的告警日志,且实例出现了宕机。调用栈类似如下:

    [80569.393775] BUG kmalloc-256(15:94ef869ce655ebab64b08cd78ee00d16c20efd5737493b48293de41fe41b04a0) (Tainted: P    B   W  OE  ------------ T):
    Objects remaining in kmalloc-256(15:94ef869ce655ebab64b08cd78ee00d16c20efd5737493b48293de41fe41b04a
    [80569.397756] -----------------------------------------------------------------------------
    [80569.397756]
    [80569.400724] INFO: Slab 0xffffea0001e94a00 objects=32 used=1 fp=0xffff88007a528000 flags=0x1fffff00004080
    [80569.402702] CPU: 21 PID: 26626 Comm: dockerd Tainted: P    B   W  OE  ------------ T 3.10.0-693.2.2.el7.x86_64 #1
    [80569.404898] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014
    [80569.406747]  ffffea0001e94a00 000000004eb9a19f ffff883afee53aa0 ffffffff816a3db1
    [80569.408833]  ffff883afee53b78 ffffffff811dbf54 ffffffff00000020 ffff883afee53b88
    [80569.410731]  ffff883afee53b38 656a624f8190fff8 616d657220737463 6e6920676e696e69
    [80569.412630] Call Trace:
    [80569.414005]  [<ffffffff816a3db1>] dump_stack+0x19/0x1b
    [80569.415627]  [<ffffffff811dbf54>] slab_err+0xb4/0xe0
    [80569.417204]  [<ffffffff811e0623>] ? __kmalloc+0x1e3/0x230
    [80569.420419]  [<ffffffff811e1939>] kmem_cache_close+0x149/0x2e0
    [80569.422006]  [<ffffffff811e1ae4>] __kmem_cache_shutdown+0x14/0x80
    [80569.423606]  [<ffffffff811a6874>] kmem_cache_destroy+0x44/0xf0
    [80569.425149]  [<ffffffff811f6019>] kmem_cache_destroy_memcg_children+0x89/0xb0
    [80569.426800]  [<ffffffff811a6849>] kmem_cache_destroy+0x19/0xf0
    [80569.428309]  [<ffffffff8123b18e>] bioset_free+0xce/0x110
    [80569.431306]  [<ffffffffc06d0b43>] dm_destroy+0x13/0x20 [dm_mod]
    [80569.432803]  [<ffffffffc06d69be>] dev_remove+0x11e/0x180 [dm_mod]
    [80569.435851]  [<ffffffffc06d7015>] ctl_ioctl+0x1e5/0x500 [dm_mod]
    [80569.437363]  [<ffffffffc06d7343>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
    [80569.438882]  [<ffffffff8121524d>] do_vfs_ioctl+0x33d/0x540
    [80569.443291]  [<ffffffff812154f1>] SyS_ioctl+0xa1/0xc0
    [80569.446228]  [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
  • 问题原因

    在使用memory cgroup kmem功能的过程中,kmem_cache_destroy在销毁kmem_cache时,会先删除memcg cache再检查refcount是否为0。由于refcount不为0,因此可能存在其他合法任务尝试通过当前kmem_cachememcg cache分配slab,进而导致race触发宕机。

  • 解决方案

    建议您在ECS实例内,关闭memory cgroup kmem功能。操作步骤如下:

    1. 运行以下命令,打开/etc/default/grub文件。

      vim /etc/default/grub
    2. i键进入编辑模式,在GRUB_CMDLINE_LINUX中添加以下配置信息。

      cgroup.memory=nokmem

      OS

    3. Esc键退出编辑模式,并输入:wq后按Enter键,保存退出文件。

    4. 运行以下命令,更新GRUB。

      grub2-mkconfig -o /boot/grub2/grub.cfg
    5. 运行以下命令,重启ECS实例。

      reboot

    如果您的操作系统无法通过命令行(cmdline)关闭memory cgroup kmem,则建议您在ECS实例内的任何程序均不配置memory.kmem.limit_in_bytes的值。即可保证memory cgroup kmem功能未开启。

实例宕机并产生日志“unable to handle kernel NULL pointer dereference”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“unable to handle kernel NULL pointer dereference”日志,调用栈类似如下:

    [8794845.086660] BUG: unable to handle kernel NULL pointer dereference at (null)
    [8794845.088500] IP: [<ffffffff8128f89c>] kref_get+0xc/0x30
    [8794845.089355] PGD 812ca2067 PUD 6dd707067 PMD 0 
    [8794845.090303] Oops: 0000 [#1] SMP 
    [8794845.091005] last sysfs file: /sys/devices/system/cpu/online
    [8794845.091861] CPU 3 
    [8794845.092212] Modules linked in: ysec_firewall_kmod(U) tcp_diag inet_diag nf_conntrack_netlink nfnetlink nf_conntrack_ipv6 nf_defrag_ipv6 ip6_tables xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ipv6 virtio_balloon virtio_net virtio_console i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ysec_firewall_kmod]
    [8794845.101913] 
    [8794845.102621] Pid: 21908, comm: ysec_hids_mod_l Tainted: G        W  ---------------    2.6.32-504.16.2.el6.x86_64 #1 Alibaba Cloud Alibaba Cloud ECS
    [8794845.105481] RIP: 0010:[<ffffffff8128f89c>]  [<ffffffff8128f89c>] kref_get+0xc/0x30
    [8794845.107400] RSP: 0018:ffff88045f5a3e38  EFLAGS: 00010292
    [8794845.108628] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000fffffff3
    [8794845.110501] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    [8794845.112371] RBP: ffff88045f5a3e48 R08: 0000000000000000 R09: ffff88050f507f00
    [8794845.114133] R10: 0000000000000003 R11: 0000000000000206 R12: ffffffff8161b040
    [8794845.115994] R13: 0000000000000040 R14: 00007f4b457f94d0 R15: 0000000000000000
    [8794845.117865] FS:  00007f4b457fb700(0000) GS:ffff880030380000(0000) knlGS:0000000000000000
    [8794845.119846] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [8794845.121055] CR2: 0000000000000000 CR3: 00000006f6837000 CR4: 00000000001406e0
    [8794845.122807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [8794845.124685] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [8794845.126558] Process ysec_hids_mod_l (pid: 21908, threadinfo ffff88045f5a2000, task ffff8806d43acab0)
    [8794845.128689] Stack:
    [8794845.129414]  ffff88045f5a3e68 0000000000000000 ffff88045f5a3e68 ffffffff810d6ae6
    [8794845.131107] <d> ffffffff8161b040 ffff8806c03a3520 ffff88045f5a3ef8 ffffffff81203898
    [8794845.133479] <d> 00007f4b457f9510 0000000000000000 ffff88045f5a3eb8 ffffffff8128c635
    [8794845.136365] Call Trace:
    [8794845.137127]  [<ffffffff810d6ae6>] pidns_get+0x26/0x30
    [8794845.138367]  [<ffffffff81203898>] proc_ns_readlink+0xc8/0x180
    [8794845.139665]  [<ffffffff8128c635>] ? _atomic_dec_and_lock+0x55/0x80
    [8794845.141008]  [<ffffffff811ab151>] ? touch_atime+0x71/0x1a0
    [8794845.142268]  [<ffffffff81193b0e>] sys_readlinkat+0xfe/0x120
    [8794845.143536]  [<ffffffff81193b4b>] sys_readlink+0x1b/0x20
    [8794845.144695]  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
  • 问题原因

    内核或驱动访问非法内存。

  • 解决方案

实例宕机并产生日志“unable to handle kernel paging request at”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“unable to handle kernel paging request at”日志,调用栈类似如下:

    [85899.344803] BUG: unable to handle kernel paging request at ffffffffc0b0ceef
    [85899.345643] IP: [<ffffffffc0b0ceef>] 0xffffffffc0b0ceef
    [85899.346119] PGD 24f212067 PUD 24f214067 PMD 24e421067 PTE 0
    [85899.346670] Oops: 0010 [#1] SMP 
    [85899.346982] Modules linked in: nfnetlink_queue nfnetlink_log bluetooth rfkill ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink xt_addrtype br_netfilter tcp_diag inet_diag xt_set ip_set_hash_ip tampercfg(OE) overlay(T) ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter iosf_mbi ppdev virtio_balloon crc32_pclmul parport_pc ghash_clmulni_intel parport shpchp i2c_piix4 aesni_intel lrw gf128mul glue_helper joydev
    [85899.354796]  ablk_helper pcspkr cryptd ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_console virtio_blk cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel virtio_pci i2c_core serio_raw virtio_ring floppy virtio [last unloaded: tampercore]
    [85899.358255] CPU: 2 PID: 1 Comm: systemd Tainted: G           OE  ------------ T 3.10.0-862.14.4.el7.x86_64 #1
    [85899.359264] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [85899.360050] task: ffff9880fa2c0000 ti: ffff9880fa2c8000 task.ti: ffff9880fa2c8000
    [85899.360817] RIP: 0010:[<ffffffffc0b0ceef>]  [<ffffffffc0b0ceef>] 0xffffffffc0b0ceef
    [85899.361636] RSP: 0018:ffff9880fa2cbd30  EFLAGS: 00010246
    [85899.362181] RAX: 0000000000000000 RBX: 000055a50e52e3c0 RCX: 0000000000000000
    [85899.362913] RDX: 0000000180080006 RSI: fffff786c5c52800 RDI: 0000000040000000
    [85899.363645] RBP: ffff9880fa2cbf48 R08: ffff9880f14a0000 R09: 0000000180080005
    [85899.364372] R10: 00000000f14a3001 R11: fffff786c5c52800 R12: ffff9880fa2cbd30
    [85899.365107] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    [85899.365840] FS:  00007fa181b3a940(0000) GS:ffff9883bfc80000(0000) knlGS:0000000000000000
    [85899.366669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [85899.367257] CR2: ffffffffc0b0ceef CR3: 000000024ed44000 CR4: 00000000003606e0
    [85899.367992] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [85899.368728] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [85899.369453] Call Trace:
    [85899.369726]  [<ffffffffa392579b>] system_call_fastpath+0x22/0x27
    [85899.370339] Code:  Bad RIP value.
    [85899.370729] RIP  [<ffffffffc0b0ceef>] 0xffffffffc0b0ceef
    [85899.371292]  RSP <ffff9880fa2cbd30>
    [85899.373188] CR2: ffffffffc0b0ceef
  • 问题原因

    内核或驱动访问非法内存。

  • 解决方案