文档

Linux系统的ECS实例宕机问题排查

更新时间:

当Linux操作系统的ECS实例在运行过程中出现内核panic、内存溢出OOM(Out Of Memory)、蓝屏卡死等问题或收到系统事件通知实例出现操作系统崩溃时,说明该ECS实例发生宕机,您可以通过自助诊断工具或系统内核日志来定位问题并解决。

定位宕机原因

您可以通过以下方式,定位发生宕机的具体原因。

方式一:(推荐)通过自助诊断工具定位

  1. 登录ECS管理控制台,左侧导航栏单击自助问题排查

  2. 单击实例问题排查页签。

  3. 选择实例无法连接或启动异常 > 实例出现宕机,然后选择出现宕机的实例ID,单击开始排查。

    根据返回的诊断结果和修复方案,定位问题并解决。

方式二:通过系统事件定位

  1. 登录ECS管理控制台,左侧导航栏单击事件

  2. 在左侧导航栏单击非预期运维事件

  3. 单击发生宕机运维事件实例右侧的诊断操作系统错误根因,诊断实例宕机原因。

    根据返回的诊断结果和修复方案,定位问题并解决。

方式三:通过kdump查看内核日志定位

若您安装并配置了kdump,当系统发生宕机时,会生成vmcore-dmesg.txt文件,您可通过查看该文件获取宕机时的内核日志,并根据其中的calltrace信息(通常以"Call Trace:"开头)来定位问题的发生位置,分析宕机原因,从而进行修复和调试。

动手实践

如您想动手实践本文档的内容,请单击验证Guestos panic诊断能力

常见宕机原因和解决方案

实例宕机并产生日志“not syncing: Out of memory: system-wide panic_on_oom is enabled”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“not syncing: Out of memory: system-wide panic_on_oom is enabled”日志,调用栈类似如下:

    [3624965.306801] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
    [3624965.307824] CPU: 5 PID: 8510 Comm: AliDetect Kdump: loaded Tainted: GOE  ------------ T 3.10.0-1127.10.1.el7.x86_64 #1
    [3624965.308923] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [3624965.309671] Call Trace:
    [3624965.309935]  [<ffffffff8f37ffa5>] dump_stack+0x19/0x1b
    [3624965.310444]  [<ffffffff8f379541>] panic+0xe8/0x21f
    [3624965.310913]  [<ffffffff8edc26b5>] check_panic_on_oom+0x55/0x60
    [3624965.311480]  [<ffffffff8edc2aab>] out_of_memory+0x23b/0x4f0
    [3624965.312027]  [<ffffffff8f37b3e0>] __alloc_pages_slowpath+0x5db/0x729
    [3624965.312628]  [<ffffffff8edc91a6>] __alloc_pages_nodemask+0x436/0x450
    [3624965.313233]  [<ffffffff8ee18e78>] alloc_pages_current+0x98/0x110
    [3624965.313808]  [<ffffffff8edbe3d7>] __page_cache_alloc+0x97/0xb0
    [3624965.314364]  [<ffffffff8edc0f90>] filemap_fault+0x270/0x420
    [3624965.314912]  [<ffffffffc04ea7d6>] ext4_filemap_fault+0x36/0x50 [ext4]
    [3624965.315530]  [<ffffffff8ededf4a>] __do_fault.isra.61+0x8a/0x100
    [3624965.316095]  [<ffffffff8edee4fc>] do_read_fault.isra.63+0x4c/0x1b0
    [3624965.316680]  [<ffffffff8edf5d60>] handle_mm_fault+0xa20/0xfb0
    [3624965.317231]  [<ffffffff8f38d653>] __do_page_fault+0x213/0x500
    [3624965.317775]  [<ffffffff8f38da26>] trace_do_page_fault+0x56/0x150
    [3624965.318378]  [<ffffffff8f38cfa2>] do_async_page_fault+0x22/0xf0
    [3624965.318954]  [<ffffffff8f3897a8>] async_page_fault+0x28/0x30
  • 问题原因

    实例内存不足发生了OOM,且内核参数vm.panic_on_oom的值被设置为1或2。

    • 值为1时,表示内存不足时,有可能会触发kernel panic,也有可能启动OOM killer。

    • 值为2时,表示内存不足时,强制触发kernel panic。

  • 解决方案

    方案一:将内核参数vm.panic_on_oom设置为0

    您可以将内核参数vm.panic_on_oom设置为0,在内存不足时启动OOM killer来解决上述问题。

    重要

    更改vm.panic_on_oom的值为0可能会导致系统在内存不足时启动OOM killer,并终止占用大量内存的进程。这可能会对系统的稳定性和运行中的应用程序产生影响。因此,在进行此类更改之前,请确保了解其影响,并评估系统的内存管理和应用程序的需求。

    1. 远程连接ECS实例。

    2. 执行以下命令,打开文件/etc/sysctl.conf

      sudo vim /etc/sysctl.conf
    3. i键,修改为以下内容。

      vm.panic_on_oom = 0

      这将禁用系统在内存不足时发生崩溃。

    4. Ecs键,输入:wq,保存文件并退出编辑器。

    5. 执行以下命令以加载sysctl.conf中的更改。

      sudo sysctl -p

    方案二:优化内存使用

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

    OOM通常是由内存不足引起的,您可以根据业务情况判断内存使用是否合理,可以考虑以下方法来提高系统的内存容量,或减少内存使用:

    • 升级实例规格

      升级实例规格,您可以获得更多的内存资源。具体操作,请参见修改实例规格

    • 优化应用程序:

      检查应用程序的内存使用情况,并进行优化。例如,通过减少内存泄露、优化算法或配置等方式。

实例宕机并产生日志“RIP: tcp_create_openreq_child”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中发生了宕机,产生日志“RIP: tcp_create_openreq_child”,调用栈类似如下:

    [8343753.027138] Oops: 0000 [#1] SMP PTI
    [8343753.027431] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           OE     5.4.0-122-generic #138-Ubuntu
    [8343753.028127] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [8343753.028728] RIP: 0010:tcp_create_openreq_child+0x2fd/0x410
    ...
    [8343753.036508] Call Trace:
    [8343753.036710]  <IRQ>
    [8343753.036886]  tcp_v4_syn_recv_sock+0x5a/0x400
    [8343753.037234]  tcp_get_cookie_sock+0x48/0x150
    [8343753.037564]  cookie_v4_check+0x581/0x6d0
    [8343753.037880]  tcp_v4_do_rcv+0x1a5/0x200
    [8343753.038184]  tcp_v4_rcv+0xc76/0xd10
    [8343753.038551]  ip_protocol_deliver_rcu+0x30/0x1b0
    [8343753.038980]  ip_local_deliver_finish+0x48/0x50
    [8343753.039335]  ip_local_deliver+0x73/0xf0
  • 问题原因

    操作系统内核版本Bug(例如内核中存在错误或缺陷),导致空指针引用错误,触发系统的保护机制,引起实例宕机。Bug详情

  • 解决方案

    将操作系统内核版本升级到5.4.0-123.139或更高版本。具体操作,请参见升级Linux ECS实例内核

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“sysrq_handle_crash”

  • 问题描述

    Linux操作系统的ECS实例在运行中宕机重启,产生日志“RIP: sysrq_handle_crash”,调用栈类似如下:

    [ 7262.769377] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_powerclamp iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper virtio_balloon shpchp cryptd parport_pc parport i2c_piix4 pcspkr ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_blk virtio_console cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel serio_raw drm ata_piix virtio_pci libata virtio_ring i2c_core virtio floppy
    [ 7262.774113] CPU: 1 PID: 3818 Comm: bash Not tainted 3.10.0-514.26.2.el7.x86_64 #1
    [ 7262.774699] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [ 7262.775317] task: ffff88040d3d5e20 ti: ffff8803cb7ac000 task.ti: ffff8803cb7ac000
    [ 7262.775904] RIP: 0010:[<ffffffff813ee1d6>]  [<ffffffff813ee1d6>] sysrq_handle_crash+0x16/0x20
    ...
    [ 7262.784790] Call Trace:
    [ 7262.784992]  [<ffffffff813ee9f7>] __handle_sysrq+0x107/0x170
    [ 7262.785450]  [<ffffffff813eee6f>] write_sysrq_trigger+0x2f/0x40
    [ 7262.785915]  [<ffffffff8126be0d>] proc_reg_write+0x3d/0x80
    [ 7262.786355]  [<ffffffff811fe9fd>] vfs_write+0xbd/0x1e0
    [ 7262.786759]  [<ffffffff811ff51f>] SyS_write+0x7f/0xe0
    [ 7262.787172]  [<ffffffff81697809>] system_call_fastpath+0x16/0x1b
  • 问题原因

    用户在实例内部使用以下命令主动触发了宕机:

    echo c > /proc/sysrq-trigger
  • 解决方案

    取消执行命令:echo c > /proc/sysrq-trigger,以避免进一步的系统崩溃。

实例宕机并产生日志“RIP:get_target_pstate_use_performance”

  • 问题描述

    Linux操作系统的ECS实例在运行中出现宕机,产生“RIP:get_target_pstate_use_performance”日志,调用栈类似如下:

    [    1.076899] divide error: 0000 [#1] SMP
    [    1.077669] Modules linked in:
    [    1.078302] CPU: 4 PID: 9 Comm: rcu_sched Not tainted 3.10.0-1127.19.1.el7.x86_64 #1
    [    1.079519] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014
    [    1.080724] task: ffff91c8fa111070 ti: ffff91c8fa11c000 task.ti: ffff91c8fa11c000
    [    1.081919] RIP: 0010:[<ffffffff85dc3089>]  [<ffffffff85dc3089>] get_target_pstate_use_performance+0x29/0xc0
    [    1.083355] RSP: 0000:ffff91c8fa11fb40  EFLAGS: 00010006
    [    1.093192] Call Trace:
    [    1.093715]  [<ffffffff85dc4081>] intel_pstate_update_util+0x161/0x310
    [    1.094550]  [<ffffffff858e9523>] ? load_balance+0x1a3/0xa10
    [    1.095321]  [<ffffffff858e4e87>] update_curr+0x127/0x1e0
    [    1.096123]  [<ffffffff858e52a8>] dequeue_entity+0x28/0x5c0
    [    1.096894]  [<ffffffff8586d3be>] ? kvm_sched_clock_read+0x1e/0x30
    [    1.097702]  [<ffffffff858e5893>] dequeue_task_fair+0x53/0x660
    [    1.098490]  [<ffffffff858debe5>] ? sched_clock_cpu+0x85/0xc0
    [    1.099266]  [<ffffffff858d7a56>] deactivate_task+0x46/0xd0
  • 问题原因

    该问题可能是由于ECS实例在启动过程中,Intel pstate驱动的current_pstate频率值被初始化为0造成的。在进程切换时,系统会调用Intel pstate来调节性能模式以适应系统负载的变化。当Intel pstate使用了current_pstate的0值,就可能导致除以零的运算错误,最终导致系统崩溃。

  • 解决方案

    将操作系统内核版本升级到4.18或更高版本。具体操作,请参见升级Linux ECS实例内核

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“not syncing: Out of memory and no killable processes”

  • 问题描述

    Linux操作系统的运行过程中出现了宕机,产生“not syncing: Out of memory and no killable processes”日志,调用栈类似于如下:

    [217894.026467] Out of memory: Kill process 17807 (php-fpm) score 4 or sacrifice child
    [217894.027560] Killed process 17807 (php-fpm) total-vm:386252kB, anon-rss:6972kB, file-rss:144kB, shmem-rss:9020kB
    [217894.910947] php-fpm invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
    [217894.912175] php-fpm cpuset=/ mems_allowed=0
    [217894.913100] CPU: 0 PID: 18534 Comm: php-fpm Tainted: GOE  ------------   3.10.0-957.21.3.el7.x86_64 #1
    [217894.914510] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [217894.915780] Call Trace:
    [217894.916607]  [<ffffffff8ff63107>] dump_stack+0x19/0x1b
    [217894.917775]  [<ffffffff8ff5db2a>] dump_header+0x90/0x229
    [217894.918914]  [<ffffffff8f901292>] ? ktime_get_ts64+0x52/0xf0
    [217894.919979]  [<ffffffff8f9584df>] ? delayacct_end+0x8f/0xb0
    [217894.921026]  [<ffffffff8f9ba834>] oom_kill_process+0x254/0x3d0
    [217894.922097]  [<ffffffff8f9ba2dd>] ? oom_unkillable_task+0xcd/0x120
    [217894.923248]  [<ffffffff8f9ba386>] ? find_lock_task_mm+0x56/0xc0
    [217894.924364]  [<ffffffff8f9bb076>] out_of_memory+0x4b6/0x4f0
    [217894.925513]  [<ffffffff8ff5e62e>] __alloc_pages_slowpath+0x5d6/0x724
  • 问题原因

    系统发生了内存不足,并且没有找到可终止的进程来释放内存,导致系统无法正常运行。

  • 解决方案

    您可以根据业务情况判断内存使用是否合理,可以考虑以下方法来提高系统的内存容量或减少内存使用:

    • 升级实例规格

      升级实例规格,获得更多的内存资源。具体操作,请参见修改实例规格

    • 优化应用程序

      检查ECS实例中占用内存过高的进程,判断内存使用是否合理,并进行优化。例如,减少内存泄露、优化算法或配置等。

实例宕机并产生日志“RIP:__list_del_entry_valid.cold”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“list_del corruption, ffff91bc2ad47048->prev is LIST_POISON2 (dead000000000200)”日志,调用栈类似如下:

    [1072741.548729] list_del corruption, ffff91bc2ad47048->prev is LIST_POISON2 (dead000000000200)
    [1072741.549507] ------------[ cut here ]------------
    [1072741.549886] kernel BUG at lib/list_debug.c:50!
    [1072741.550275] invalid opcode: 0000 [#1] SMP PTI
    [1072741.550646] CPU: 0 PID: 1583643 Comm: kworker/0:1 Tainted: G           OE    --------- -  - 4.18.0-305.3.1.el8.x86_64 #1
    [1072741.551468] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [1072741.552048] Workqueue: cgroup_destroy css_release_work_fn
    [1072741.552462] RIP: 0010:__list_del_entry_valid.cold.1+0x45/0x4c
    ...
    [1072741.560426] Call Trace:
    [1072741.560638]  css_release_work_fn+0x3f/0x240
    [1072741.560983]  process_one_work+0x1a7/0x360
    [1072741.561300]  worker_thread+0x30/0x390
    [1072741.561622]  ? create_worker+0x1a0/0x1a0
    [1072741.561933]  kthread+0x116/0x130
    [1072741.562195]  ? kthread_flush_work_fn+0x10/0x10
    [1072741.562557]  ret_from_fork+0x35/0x40
    [1072741.562843] Modules linked in: AliSecGuard(OE) nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl joydev pcspkr virtio_balloon i2c_piix4 ip_tables xfs libcrc32c ata_generic cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ata_piix libata crc32c_intel virtio_net net_failover serio_raw failover virtio_console virtio_blk
    [1072741.566968] Features: eBPF/event
    [1072741.567302] ---[ end trace 8f40bd2bf2a072e5 ]---
  • 问题原因

    操作系统内核版本Bug:list_del发生错误LIST_POISON2 (dead000000000200)引发的宕机。Bug详情

  • 解决方案

    将操作系统内核版本升级到kernel-4.18.0-305.12.1.el8_4或更高版本。具体操作,请参见升级Linux ECS实例内核

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“RIP:module_put”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“RIP:module_put”日志,调用栈类似如下:

    [86389.969666] CPU: 2 PID: 1426 Comm: Syn-1203-Tx Tainted: GOE  ------------   3.10.0-1160.53.1.el7.x86_64 #1
    [86389.970626] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [86389.971377] task: ffff983118bfc200 ti: ffff982defd58000 task.ti: ffff982defd58000
    [86389.972034] RIP: 0010:[<ffffffff8c91956d>]  [<ffffffff8c91956d>] module_put+0x1d/0x80
    ...
    [86389.979170] Call Trace:
    [86389.979378]  [<ffffffff8ca53b40>] cdev_put+0x20/0x30
    [86389.979768]  [<ffffffff8ca5098f>] __fput+0x1ef/0x230
    [86389.980151]  [<ffffffff8ca50abe>] ____fput+0xe/0x10
    [86389.980526]  [<ffffffff8c8c299b>] task_work_run+0xbb/0xe0
    [86389.980946]  [<ffffffff8c8a1954>] do_exit+0x2d4/0xa30
    [86389.981375]  [<ffffffff8c91358f>] ? futex_wait+0x11f/0x280
  • 问题原因

    系统进程使用或访问已被释放的内存,引发了use-after-free漏洞,触发操作系统的保护机制或导致数据混乱,从而导致系统崩溃。

    说明

    Use-after-free是一种常见的软件漏洞类型,它发生在程序错误地使用或访问已经释放的内存时。这种情况可能会导致不可预测的行为,例如崩溃、数据损坏、数据泄露或执行恶意代码。

  • 解决方案

    将操作系统内核版本升级到kernel-4.18.0-305.12.1.el8_4或更高版本。具体操作,请参见升级Linux ECS实例内核

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“containerd: page allocation failure”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“containerd: page allocation failure”日志,调用栈类似如下:

    [1558839.130515] ------------[ cut here ]------------
    [1558839.131215] kernel BUG at lib/idr.c:1163!
    [1558839.131797] invalid opcode: 0000 [#1] SMP 
    [1558839.132411] Modules linked in: binfmt_misc AliSecGuard(OE) AliSecProcFilter64(OE) AliSecNetFlt64(OE) xt_CT xt_multiport ipt_rpfilter iptable_raw ip_set_hash_net ip_set_hash_ip ipip tunnel4 ip_tunnel veth ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables iptable_mangle nf_conntrack_netlink xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_mark xt_addrtype xt_set ip_set_bitmap_port ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set nfnetlink dummy xt_comment iptable_nat nf_nat_ipv4 nf_nat iptable_filter tcp_diag inet_diag overlay(T) sunrpc nfit ppdev libnvdimm iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev virtio_balloon pcspkr parport_pc parport i2c_piix4 nf_conntrack_ipv4 nf_defrag_ipv4 ip_vs_sh ip_vs_wrr
    [1558839.141715]  ip_vs_rr ip_vs nf_conntrack libcrc32c br_netfilter bridge stp llc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_console virtio_blk cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci virtio_ring floppy virtio drm_panel_orientation_quirks
    [1558839.147553] CPU: 6 PID: 21465 Comm: kworker/6:0 Tainted: G           OE  ------------ T 3.10.0-957.21.3.el7.x86_64 #1
    [1558839.149181] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [1558839.150656] Workqueue: events free_work
    [1558839.151766] task: ffff8fbc4d6e9040 ti: ffff8fb8b898c000 task.ti: ffff8fb8b898c000
    [1558839.153196] RIP: 0010:[<ffffffff967774e1>]  [<ffffffff967774e1>] ida_simple_remove+0x41/0x50
    ...
    [1558839.171901] Call Trace:
    [1558839.173133]  [<ffffffff966306c4>] __mem_cgroup_free+0x234/0x250
    [1558839.174750]  [<ffffffff966306f5>] free_work+0x15/0x20
    [1558839.176259]  [<ffffffff964b9ebf>] process_one_work+0x17f/0x440
    [1558839.177872]  [<ffffffff964baf56>] worker_thread+0x126/0x3c0
    [1558839.179421]  [<ffffffff964bae30>] ? manage_workers.isra.25+0x2a0/0x2a0
    [1558839.181092]  [<ffffffff964c1da1>] kthread+0xd1/0xe0
    [1558839.182839]  [<ffffffff964c1cd0>] ? insert_kthread_work+0x40/0x40
    [1558839.184543]  [<ffffffff96b75c37>] ret_from_fork_nospec_begin+0x21/0x21
    [1558839.186238]  [<ffffffff964c1cd0>] ? insert_kthread_work+0x40/0x40
    ...
  • 问题原因

    操作系统内核版本Bug:在开启memory control group的情况下,memcg_caches[]数组会增加每个已注册的内核内存缓存。如果没有可用的内存,即发生了内存不足事件,可能会导致系统崩溃。

  • 解决方案

    CentOS 7.7建议升级到kernel-3.10.0-1062.el7及以上版本,CentOS 7.6建议升级到kernel-3.10.0-957.27.2.el7及以上版本。具体操作,请参见升级Linux ECS实例内核

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“RIP:blk_mq_rq_timed_out”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“RIP:blk_mq_rq_timed_out”日志,调用栈类似如下:

    [8837401.113325] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
    [8837401.114219] IP: [<ffffffffae575638>] blk_mq_rq_timed_out+0x18/0xa0
    [8837401.114892] PGD 8000000885d08067 PUD e1beda067 PMD 0 
    [8837401.115471] Oops: 0000 [#1] SMP 
    [8837401.115855] Modules linked in: AliSecNetFlt64(OE) AliSecGuard(OE) AliSecProcFilter64(OE) xt_multiport veth ipt_rpfilter ip6t_rpfilter ip6t_MASQUERADE nf_nat_masquerade_ipv6 xt_set iptable_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_filter ip6table_raw ip6_tables ip_set_hash_ip ip_set_hash_net ip_set sch_htb xt_nat xt_statistic ipt_REJECT nf_reject_ipv4 nf_tables iptable_mangle xt_comment xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat tcp_diag inet_diag nfsv3 nfs_acl nfs lockd grace fscache overlay(T) sunrpc nfit libnvdimm iosf_mbi crc32_pclmul ppdev virtio_balloon joydev ghash_clmulni_intel parport_pc aesni_intel parport lrw gf128mul glue_helper i2c_piix4 ablk_helper pcspkr cryptd ip_vs_rr ip_vs_sh ip_vs_wrr ip_vs nf_conntrack ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net net_failover virtio_console virtio_blk failover cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci virtio_ring floppy drm_panel_orientation_quirks virtio libcrc32c br_netfilter bridge stp llc [last unloaded: AliSecNetFlt64]
    [8837401.130281] CPU: 0 PID: 163944 Comm: kworker/0:1H Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.80.1.el7.x86_64 #1
    [8837401.133029] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014
    [8837401.134621] Workqueue: kblockd blk_mq_timeout_work
    [8837401.135916] task: ffff88258a0b6300 ti: ffff8820c2b9c000 task.ti: ffff8820c2b9c000
    [8837401.137422] RIP: 0010:[<ffffffffae575638>]  [<ffffffffae575638>] blk_mq_rq_timed_out+0x18/0xa0
    [8837401.139091] RSP: 0018:ffff8820c2b9fd18  EFLAGS: 00010246
    [8837401.140371] RAX: 0000000000000000 RBX: ffff8819b6ad0000 RCX: 0000000000000000
    [8837401.141838] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8819b6ad0000
    [8837401.143314] RBP: ffff8820c2b9fd20 R08: 000000030ec11230 R09: df98ad67960c8828
    [8837401.144732] R10: df98ad67960c8828 R11: ffff8822d9e17f00 R12: ffff8819b6863240
    [8837401.146161] R13: 0000000000000002 R14: 0000000000000020 R15: 0000000000000002
    [8837401.147605] FS:  0000000000000000(0000) GS:ffff8829bfc00000(0000) knlGS:0000000000000000
    [8837401.149177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [8837401.150426] CR2: 00000000000000d0 CR3: 00000003e570a000 CR4: 00000000003606f0
    [8837401.151844] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [8837401.153287] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [8837401.154667] Call Trace:
    [8837401.155579]  [<ffffffffae57572c>] blk_mq_check_expired+0x6c/0x80
    [8837401.157057]  [<ffffffffae578dac>] bt_iter+0x5c/0x70
    [8837401.158357]  [<ffffffffae57984b>] blk_mq_queue_tag_busy_iter+0x13b/0x320
    [8837401.159675]  [<ffffffffae2e84c9>] ? pick_next_entity+0xa9/0x190
    [8837401.160968]  [<ffffffffae5756c0>] ? blk_mq_rq_timed_out+0xa0/0xa0
    [8837401.162414]  [<ffffffffae5756c0>] ? blk_mq_rq_timed_out+0xa0/0xa0
    [8837401.163748]  [<ffffffffae57428b>] blk_mq_timeout_work+0x8b/0x180
    [8837401.165062]  [<ffffffffae2c319f>] process_one_work+0x17f/0x440
    [8837401.166329]  [<ffffffffae2c42e6>] worker_thread+0x126/0x3c0
    [8837401.167541]  [<ffffffffae2c41c0>] ? manage_workers.isra.26+0x2b0/0x2b0
    [8837401.169048]  [<ffffffffae2cb4d1>] kthread+0xd1/0xe0
    [8837401.170311]  [<ffffffffae2cb400>] ? insert_kthread_work+0x40/0x40
    [8837401.171514]  [<ffffffffae9c51f7>] ret_from_fork_nospec_begin+0x21/0x21
    [8837401.172861]  [<ffffffffae2cb400>] ? insert_kthread_work+0x40/0x40
    [8837401.174091] Code: 83 84 c6 80 00 00 00 01 e8 f6 fe ff ff 5d c3 cc cc cc cc 0f 1f 44 00 00 55 48 89 e5 53 48 8b 57 58 48 8b 47 38 48 89 fb 83 e2 02 <48> 8b 80 d0 00 00 00 74 4c 48 83 78 10 00 74 50 48 ba 00 00 00 
    [8837401.178255] RIP  [<ffffffffae575638>] blk_mq_rq_timed_out+0x18/0xa0
    [8837401.179436]  RSP <ffff8820c2b9fd18>
    [8837401.180300] CR2: 00000000000000d0
  • 问题原因

    操作系统内核版本Bug:程序访问了空指针,触发内存访问错误,从而导致实例崩溃宕机。BUG详情

  • 解决方案

    将操作系统内核升级到kernel-3.10.0-1160.88.1.el7以上版本。具体操作,请参见升级Linux ECS实例内核

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“RIP:strnlen”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“RIP:strnlen”日志,调用栈类似如下:

    [86390.829326] BUG: unable to handle kernel paging request at 0000000100620100
    [86390.829510] IP: [<ffffffff9ed7f2ad>] strnlen+0xd/0x40
    [86390.829632] PGD 0 
    [86390.829685] Oops: 0000 [#1] SMP 
    [86390.829766] Modules linked in: AliSecGuard(OE) binfmt_misc xt_conntrack iptable_filter iptable_nat nf_nat_ipv4 arc4 emp(OE) nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat nf_conntrack eudp(E) libcrc32c ppdev intel_powerclamp crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc virtio_balloon parport i2c_piix4 pcspkr ip_tables ext4 mbcache jbd2 cirrus drm_kms_helper syscopyarea sysfillrect virtio_net virtio_console virtio_blk sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common drm crc32c_intel serio_raw floppy virtio_pci virtio_ring virtio drm_panel_orientation_quirks
    [86390.831199] CPU: 2 PID: 1311 Comm: KeepAlive Tainted: G           OE  ------------   3.10.0-957.el7.x86_64 #1
    [86390.831410] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 9e9f1cc 04/01/2014
    [86390.831580] task: ffff97c77add9040 ti: ffff97c77ade0000 task.ti: ffff97c77ade0000
    [86390.831742] RIP: 0010:[<ffffffff9ed7f2ad>]  [<ffffffff9ed7f2ad>] strnlen+0xd/0x40
    ......
    [86390.833643] Call Trace:
    [86390.833699]  [<ffffffff9ed8105b>] string.isra.7+0x3b/0xf0
    [86390.833805]  [<ffffffff9ed82771>] vsnprintf+0x201/0x6a0
    [86390.833908]  [<ffffffff9ed82c1d>] vscnprintf+0xd/0x30
    [86390.834011]  [<ffffffff9ea9a24b>] vprintk_emit+0x11b/0x510
    [86390.834143]  [<ffffffff9ea9a8a9>] ? vprintk_default+0x29/0x40
    [86390.834277]  [<ffffffff9ed77ef0>] ? kobject_put+0x50/0x60
    [86390.834407]  [<ffffffff9ea9a65f>] vprintk+0x1f/0x30
    [86390.834517]  [<ffffffff9ea975ef>] __warn+0x7f/0x100
    [86390.834618]  [<ffffffff9ea976cf>] warn_slowpath_fmt+0x5f/0x80
    [86390.834746]  [<ffffffffc02e2b64>] ? close_eudp_mmap_dev+0x1b4/0x200 [eudp]
    [86390.834896]  [<ffffffff9ed77ef0>] kobject_put+0x50/0x60
    [86390.835013]  [<ffffffff9ec466f8>] cdev_put+0x18/0x30
    [86390.835125]  [<ffffffff9ec4350a>] __fput+0x21a/0x260
    [86390.835232]  [<ffffffff9ec4363e>] ____fput+0xe/0x10
    [86390.835340]  [<ffffffff9eabe79b>] task_work_run+0xbb/0xe0
    [86390.835459]  [<ffffffff9ea9dc61>] do_exit+0x2d1/0xa40
    [86390.835568]  [<ffffffff9ea9e44f>] do_group_exit+0x3f/0xa0
    [86390.835695]  [<ffffffff9eaaf24e>] get_signal_to_deliver+0x1ce/0x5e0
    [86390.835830]  [<ffffffff9ea2b527>] do_signal+0x57/0x6f0
    [86390.835942]  [<ffffffff9eac57e0>] ? hrtimer_get_res+0x50/0x50
    [86390.836068]  [<ffffffff9ea2bc32>] do_notify_resume+0x72/0xc0
    [86390.836202]  [<ffffffff9f175124>] int_signal+0x12/0x17
    ...
  • 问题原因

    系统安装了第三方模块eudp,该模块存在Bug(例如传递给strnlen函数的参数不正确),导致实例宕机。

  • 解决方案

    建议您卸载第三方模块eudp。

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“RIP:filp_close”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“RIP:filp_close”日志,调用栈类似如下:

    [ 1891.552008] BUG: unable to handle kernel NULL pointer dereference at 0000000000000036
    [ 1891.552149] IP: [<ffffffff8801c67e>] filp_close+0xe/0x90
    [ 1891.552239] PGD 40819b067 PUD 40819a067 PMD 0 
    [ 1891.552321] Oops: 0000 [#1] SMP 
    [ 1891.552380] Modules linked in: AliSecGuard(OE) AliSecNetFlt64(OE) tampercore(OE) tampercfg(OE) ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_powerclamp crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc parport i2c_piix4 shpchp virtio_balloon pcspkr ip_tables ext4 mbcache jbd2 cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_net virtio_console virtio_blk drm crct10dif_pclmul crct10dif_common virtio_pci crc32c_intel virtio_ring i2c_core serio_raw virtio floppy
    [ 1891.553945] CPU: 3 PID: 2778 Comm: AliHips Tainted: G           OE  ------------   3.10.0-862.14.4.el7.x86_64 #1
    [ 1891.554107] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 9e9f1cc 04/01/2014
    [ 1891.554228] task: ffff88d4cd7e4f10 ti: ffff88d4c5af8000 task.ti: ffff88d4c5af8000
    [ 1891.554346] RIP: 0010:[<ffffffff8801c67e>]  [<ffffffff8801c67e>] filp_close+0xe/0x90
    ......
    [ 1891.555727] Call Trace:
    [ 1891.555772]  [<ffffffffc08d0d7c>] is_pathsite+0x1ac/0x400 [tampercore]
    [ 1891.555878]  [<ffffffff88055e1a>] ? bh_lru_install+0x18a/0x1e0
    [ 1891.555974]  [<ffffffff880563fc>] ? __find_get_block+0xbc/0x120
    [ 1891.556069]  [<ffffffff8805648d>] ? __getblk+0x2d/0x300
    [ 1891.556160]  [<ffffffffc02d956b>] ? search_dir+0x8b/0x120 [ext4]
    [ 1891.556258]  [<ffffffff87ebeed5>] ? wake_up_bit+0x25/0x30
    [ 1891.556345]  [<ffffffff88055b2d>] ? __brelse+0x3d/0x50
    [ 1891.556432]  [<ffffffffc02d9a69>] ? ext4_find_entry+0x299/0x570 [ext4]
    [ 1891.556536]  [<ffffffff880380cd>] ? __d_instantiate+0x2d/0xe0
    [ 1891.556629]  [<ffffffff88037446>] ? _d_rehash+0x36/0x40
    [ 1891.556712]  [<ffffffff88037473>] ? d_rehash+0x23/0x40
    [ 1891.556795]  [<ffffffff8803866c>] ? d_splice_alias+0xdc/0x120
    [ 1891.556891]  [<ffffffffc02da368>] ? ext4_lookup+0x118/0x170 [ext4]
    [ 1891.556993]  [<ffffffff8802b2b3>] ? lookup_fast+0xb3/0x230
    [ 1891.557080]  [<ffffffff8802ca48>] ? link_path_walk+0x238/0x8b0
    [ 1891.558026]  [<ffffffff8809769b>] ? proc_pid_permission+0x9b/0xc0
    [ 1891.558976]  [<ffffffff8802dfea>] ? path_lookupat+0x7a/0x8b0
    [ 1891.559917]  [<ffffffffc08d20db>] tamperhack_mkdir.part.4+0x12b/0x190 [tampercore]
    [ 1891.560888]  [<ffffffffc08d2185>] tamperhack_mkdir+0x45/0x50 [tampercore]
    [ 1891.561828]  [<ffffffff8852579b>] system_call_fastpath+0x22/0x27
    [ 1891.562736] Code: ff 00 00 00 00 e9 d3 fe ff ff 0f 1f 00 b8 ea ff ff ff eb 9d e8 c4 7c e7 ff 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 <48> 8b 47 38 48 89 fb 48 85 c0 74 5b 48 8b 47 28 49 89 f4 48 85 
    [ 1891.564925] RIP  [<ffffffff8801c67e>] filp_close+0xe/0x90
  • 问题原因

    系统安装了第三方模块Tampercore,该模块存在Bug,导致filp_close函数调用时发生了错误,进而导致实例宕机。

  • 解决方案

    建议您卸载或升级第三方模块Tampercore。

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“VFS: Unable to mount root fs on unknown-block”

  • 问题描述

    Linux操作系统的ECS实例在启动过程中出现循环宕机,无法正常进入系统,产生“VFS: Unable to mount root fs on unknown-block”日志,调用栈类似如下:

    [    1.573197] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
    [    1.574179] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 3.10.0-1160.6.1.el7.x86_64 #1
    [    1.575045] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014
    [    1.575900] Call Trace:
    [    1.576246]  [<ffffffff8f381400>] dump_stack+0x19/0x1b
    [    1.576845]  [<ffffffff8f37a958>] panic+0xe8/0x21f
    [    1.577433]  [<ffffffff8f98b794>] mount_block_root+0x291/0x2a0
    [    1.578122]  [<ffffffff8f98b7f6>] mount_root+0x53/0x56
    [    1.578719]  [<ffffffff8f98b935>] prepare_namespace+0x13c/0x174
    [    1.579425]  [<ffffffff8f98b412>] kernel_init_freeable+0x222/0x249
    [    1.580150]  [<ffffffff8f98ab28>] ? initcall_blacklist+0xb0/0xb0
    [    1.580838]  [<ffffffff8f36fa90>] ? rest_init+0x80/0x80
    [    1.581462]  [<ffffffff8f36fa9e>] kernel_init+0xe/0x100
    [    1.582073]  [<ffffffff8f394df7>] ret_from_fork_nospec_begin+0x21/0x21
    [    1.582814]  [<ffffffff8f36fa90>] ? rest_init+0x80/0x80
  • 问题原因

    内核升级被中断或出错,导致根文件系统(rootfs)被损坏,ECS实例在启动过程中找不到根分区的文件系统,进而导致实例宕机。

  • 解决方案

    建议您为ECS实例更换系统盘,或者基于已创建的快照回滚云盘。具体操作,请参见更换操作系统(系统盘)使用快照回滚云盘

    重要

    在操作前,建议您为ECS实例创建快照备份数据,避免因误操作造成的数据丢失。创建快照的具体操作,请参见创建快照

实例宕机并产生日志“RIP:virtio_check_driver_offered_feature”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“RIP:virtio_check_driver_offered_feature”日志,调用栈类似如下:

    [55686.388353] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
    [55686.389223] IP: [<ffffffffc0047450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio]
    [55686.390030] PGD 229af2067 PUD 21cbac067 PMD 0 
    [55686.390514] Oops: 0000 [#1] SMP 
    [55686.390867] Modules linked in: unix_diag AliSecGuard(OE) udp_diag tcp_diag inet_diag joydev binfmt_misc xfs libcrc32c dm_mod kvm_amd kvm irqbypass crc32_pclmul ppdev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper parport_pc ablk_helper cryptd virtio_balloon pcspkr parport i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_blk virtio_console cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common ata_piix crc32c_intel virtio_pci libata serio_raw virtio_ring virtio drm_panel_orientation_quirks floppy
    [55686.396603] CPU: 0 PID: 19222 Comm: fdisk Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.1.2.el7.x86_64 #1
    [55686.397848] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014
    [55686.398578] task: ffff964836e8e2a0 ti: ffff964860370000 task.ti: ffff964860370000
    [55686.399303] RIP: 0010:[<ffffffffc0047450>]  [<ffffffffc0047450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio]
    ....
    [55686.406216] Call Trace:
    [55686.406473]  [<ffffffffc0102b4c>] virtblk_ioctl+0x3c/0x70 [virtio_blk]
    [55686.407098]  [<ffffffff955608b5>] __blkdev_driver_ioctl+0x25/0x40
    [55686.407697]  [<ffffffffc03b5024>] dm_blk_ioctl+0x74/0xb0 [dm_mod]
    [55686.408289]  [<ffffffff955612fa>] blkdev_ioctl+0x28a/0xa20
    [55686.408817]  [<ffffffff95488771>] block_ioctl+0x41/0x50
    [55686.409319]  [<ffffffff9545d9e0>] do_vfs_ioctl+0x3a0/0x5a0
    [55686.409845]  [<ffffffff95305a82>] ? ktime_get+0x52/0xe0
    [55686.410345]  [<ffffffff955024ec>] ? security_file_ioctl+0x1c/0x20
    [55686.410930]  [<ffffffff9545dc81>] SyS_ioctl+0xa1/0xc0
    [55686.411429]  [<ffffffff9598cede>] system_call_fastpath+0x25/0x2a
    [55686.411999] Code: d5 89 de 48 c7 c7 e0 93 04 c0 e8 4c 98 53 d5 5b 5d c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 8b 8f a0 00 00 00 48 89 e5 <8b> 91 90 00 00 00 85 d2 74 2c 48 8b 81 88 00 00 00 39 30 74 59 
    [55686.414738] RIP  [<ffffffffc0047450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio]
  • 问题原因

    实例使用了逻辑卷管理(LVM),且一个逻辑卷(LV)关联到了设备(假设为vdc),但实际上该设备已被删除。由于LVM中仍然保留了对应设备的配置信息,当执行涉及该设备的命令(如blkidfdisk)时,会导致实例崩溃。

  • 解决方案

    • 方案一:使用LVM命令删除不存在的设备的配置,以使LVM中的配置与实际设备一致。

    • 方案二:升级内核版本至kernel-3.10.0-1160.6.1.el7以上。具体操作,请参见升级Linux ECS实例内核

实例宕机并产生日志“Out of memory and no killable processes”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“Out of memory and no killable processes”日志,调用栈类似如下:

    [28663.625353] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
    [28663.625363] [ 1799]     0  1799    26512      245      56       3        0         -1000 sshd
    [28663.625367] [29219]     0 29219    10832      126      26       3        0         -1000 systemd-udevd
    [28663.625375] Kernel panic - not syncing: Out of memory and no killable processes...
    [28663.634374] CPU: 1 PID: 3578 Comm: kworker/u176:4 Tainted: G           OE   3.10.0-1062.9.1.el7.x86_64 #1
    [28663.676873] Call Trace:
    [28663.679312]  [<ffffffff8139f342>] dump_stack+0x63/0x81
    [28663.684421]  [<ffffffff811b2245>] panic+0xf8/0x244
    [28663.689184]  [<ffffffff811b98db>] out_of_memory+0x2eb/0x550
    [28663.694726]  [<ffffffff811be254>] __alloc_pages_may_oom+0x114/0x1c0
    [28663.700959]  [<ffffffff811bedb3>] __alloc_pages_slowpath+0x7d3/0xa40
    [28663.707279]  [<ffffffff811bf229>] __alloc_pages_nodemask+0x209/0x260
    [28663.713599]  [<ffffffff81216535>] alloc_pages_current+0x95/0x140
    [28663.719573]  [<ffffffff811ba5ee>] __get_free_pages+0xe/0x40
    [28663.725113]  [<ffffffff81075dae>] pgd_alloc+0x1e/0x160
    [28663.730225]  [<ffffffff810875e4>] mm_init+0x184/0x240
    [28663.735249]  [<ffffffff81088102>] mm_alloc+0x52/0x60
    [28663.740186]  [<ffffffff81257640>] do_execveat_common.isra.37+0x250/0x780
    [28663.759839]  [<ffffffff81257b9c>] do_execve+0x2c/0x30
    [28663.764864]  [<ffffffff810a231b>] call_usermodehelper_exec_async+0xfb/0x150
    [28663.777246]  [<ffffffff81741dd9>] ret_from_fork+0x39/0x50
  • 问题原因

    操作系统内核分配内存失败后,尝试通过kill进程来释放内存,但系统没有可被kill的进程,进而触发了系统的主动宕机。出现该问题的可能原因有:

    • 系统内核存在内存泄露,从而导致系统可用内存不足。

    • oom_score_adj-1000的进程占用过多内存,该类进程无法被终止从而导致系统可用内存不足。

      说明

      oom_score_adj是一个用于调整OOM(Out of Memory)终止进程的优先级的参数。内核根据每个进程的OOM分数(oom_score)来选择要终止的进程,较低的oom_score值表示进程更有可能被终止,而较高的值表示进程更不可能被终止。

  • 解决方案

    1. 检查系统内核是否存在内存泄露。

      具体操作,请参见如何排查slab_unreclaimable内存占用高的原因?

    2. 检查进程的oom_score_adj设置是否合理。

      1. 执行以下命令,获取进程的PID。您可以使用命令如 pstoppgrep 来查找进程的 PID。

        ps aux | grep <进程名称>

        您需要将 <进程名称> 替换为您要查找的进程的名称。

      2. 执行以下命令,检查 oom_score_adj 设置。

        cat /proc/<PID>/oom_score_adj

        您需要将 <PID> 替换为已获取的进程实际PID。

        根据您的环境和需求,可以根据oom_score_adj的值来评估进程的OOM行为是否合理。如果oom_score_adj的值为-1000,则表示该进程具有较高的优先级,更不可能被内核选择进行OOM终止,从而导致系统可用内存不足。

实例宕机并产生日志“Objects remaining in kmalloc”

  • 问题描述

    当您在ECS实例内使用memory cgroup kmem功能时,内核有类似于如下所示的告警日志,且实例出现了宕机。调用栈类似如下:

    [80569.393775] BUG kmalloc-256(15:94ef869ce655ebab64b08cd78ee00d16c20efd5737493b48293de41fe41b04a0) (Tainted: P    B   W  OE  ------------ T):
    Objects remaining in kmalloc-256(15:94ef869ce655ebab64b08cd78ee00d16c20efd5737493b48293de41fe41b04a
    [80569.397756] -----------------------------------------------------------------------------
    [80569.397756]
    [80569.400724] INFO: Slab 0xffffea0001e94a00 objects=32 used=1 fp=0xffff88007a528000 flags=0x1fffff00004080
    [80569.402702] CPU: 21 PID: 26626 Comm: dockerd Tainted: P    B   W  OE  ------------ T 3.10.0-693.2.2.el7.x86_64 #1
    [80569.404898] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014
    [80569.406747]  ffffea0001e94a00 000000004eb9a19f ffff883afee53aa0 ffffffff816a3db1
    [80569.408833]  ffff883afee53b78 ffffffff811dbf54 ffffffff00000020 ffff883afee53b88
    [80569.410731]  ffff883afee53b38 656a624f8190fff8 616d657220737463 6e6920676e696e69
    [80569.412630] Call Trace:
    [80569.414005]  [<ffffffff816a3db1>] dump_stack+0x19/0x1b
    [80569.415627]  [<ffffffff811dbf54>] slab_err+0xb4/0xe0
    [80569.417204]  [<ffffffff811e0623>] ? __kmalloc+0x1e3/0x230
    [80569.420419]  [<ffffffff811e1939>] kmem_cache_close+0x149/0x2e0
    [80569.422006]  [<ffffffff811e1ae4>] __kmem_cache_shutdown+0x14/0x80
    [80569.423606]  [<ffffffff811a6874>] kmem_cache_destroy+0x44/0xf0
    [80569.425149]  [<ffffffff811f6019>] kmem_cache_destroy_memcg_children+0x89/0xb0
    [80569.426800]  [<ffffffff811a6849>] kmem_cache_destroy+0x19/0xf0
    [80569.428309]  [<ffffffff8123b18e>] bioset_free+0xce/0x110
    [80569.431306]  [<ffffffffc06d0b43>] dm_destroy+0x13/0x20 [dm_mod]
    [80569.432803]  [<ffffffffc06d69be>] dev_remove+0x11e/0x180 [dm_mod]
    [80569.435851]  [<ffffffffc06d7015>] ctl_ioctl+0x1e5/0x500 [dm_mod]
    [80569.437363]  [<ffffffffc06d7343>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
    [80569.438882]  [<ffffffff8121524d>] do_vfs_ioctl+0x33d/0x540
    [80569.443291]  [<ffffffff812154f1>] SyS_ioctl+0xa1/0xc0
    [80569.446228]  [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
  • 问题原因

    在使用memory cgroup kmem功能的过程中,kmem_cache_destroy在销毁kmem_cache时,会先删除memcg cache再检查refcount是否为0。由于refcount不为0,因此可能存在其他合法任务尝试通过当前kmem_cachememcg cache分配slab,进而导致race触发宕机。

  • 解决方案

    建议您在ECS实例内,关闭memory cgroup kmem功能。操作步骤如下:

    1. 运行以下命令,打开/etc/default/grub文件。

      vim /etc/default/grub
    2. i键进入编辑模式,在GRUB_CMDLINE_LINUX中添加以下配置信息。

      cgroup.memory=nokmem

      OS

    3. Esc键退出编辑模式,并输入:wq后按Enter键,保存退出文件。

    4. 运行以下命令,更新GRUB。

      grub2-mkconfig -o /boot/grub2/grub.cfg
    5. 运行以下命令,重启ECS实例。

      reboot

    如果您的操作系统无法通过命令行(cmdline)关闭memory cgroup kmem,则建议您在ECS实例内的任何程序均不配置memory.kmem.limit_in_bytes的值。即可保证memory cgroup kmem功能未开启。

实例宕机并产生日志“unable to handle kernel NULL pointer dereference”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“unable to handle kernel NULL pointer dereference”日志,调用栈类似如下:

    [8794845.086660] BUG: unable to handle kernel NULL pointer dereference at (null)
    [8794845.088500] IP: [<ffffffff8128f89c>] kref_get+0xc/0x30
    [8794845.089355] PGD 812ca2067 PUD 6dd707067 PMD 0 
    [8794845.090303] Oops: 0000 [#1] SMP 
    [8794845.091005] last sysfs file: /sys/devices/system/cpu/online
    [8794845.091861] CPU 3 
    [8794845.092212] Modules linked in: ysec_firewall_kmod(U) tcp_diag inet_diag nf_conntrack_netlink nfnetlink nf_conntrack_ipv6 nf_defrag_ipv6 ip6_tables xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ipv6 virtio_balloon virtio_net virtio_console i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ysec_firewall_kmod]
    [8794845.101913] 
    [8794845.102621] Pid: 21908, comm: ysec_hids_mod_l Tainted: G        W  ---------------    2.6.32-504.16.2.el6.x86_64 #1 Alibaba Cloud Alibaba Cloud ECS
    [8794845.105481] RIP: 0010:[<ffffffff8128f89c>]  [<ffffffff8128f89c>] kref_get+0xc/0x30
    [8794845.107400] RSP: 0018:ffff88045f5a3e38  EFLAGS: 00010292
    [8794845.108628] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000fffffff3
    [8794845.110501] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    [8794845.112371] RBP: ffff88045f5a3e48 R08: 0000000000000000 R09: ffff88050f507f00
    [8794845.114133] R10: 0000000000000003 R11: 0000000000000206 R12: ffffffff8161b040
    [8794845.115994] R13: 0000000000000040 R14: 00007f4b457f94d0 R15: 0000000000000000
    [8794845.117865] FS:  00007f4b457fb700(0000) GS:ffff880030380000(0000) knlGS:0000000000000000
    [8794845.119846] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [8794845.121055] CR2: 0000000000000000 CR3: 00000006f6837000 CR4: 00000000001406e0
    [8794845.122807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [8794845.124685] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [8794845.126558] Process ysec_hids_mod_l (pid: 21908, threadinfo ffff88045f5a2000, task ffff8806d43acab0)
    [8794845.128689] Stack:
    [8794845.129414]  ffff88045f5a3e68 0000000000000000 ffff88045f5a3e68 ffffffff810d6ae6
    [8794845.131107] <d> ffffffff8161b040 ffff8806c03a3520 ffff88045f5a3ef8 ffffffff81203898
    [8794845.133479] <d> 00007f4b457f9510 0000000000000000 ffff88045f5a3eb8 ffffffff8128c635
    [8794845.136365] Call Trace:
    [8794845.137127]  [<ffffffff810d6ae6>] pidns_get+0x26/0x30
    [8794845.138367]  [<ffffffff81203898>] proc_ns_readlink+0xc8/0x180
    [8794845.139665]  [<ffffffff8128c635>] ? _atomic_dec_and_lock+0x55/0x80
    [8794845.141008]  [<ffffffff811ab151>] ? touch_atime+0x71/0x1a0
    [8794845.142268]  [<ffffffff81193b0e>] sys_readlinkat+0xfe/0x120
    [8794845.143536]  [<ffffffff81193b4b>] sys_readlink+0x1b/0x20
    [8794845.144695]  [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
  • 问题原因

    内核或驱动访问非法内存。

  • 解决方案

实例宕机并产生日志“unable to handle kernel paging request at”

  • 问题描述

    Linux操作系统的ECS实例在运行过程中宕机,产生了“unable to handle kernel paging request at”日志,调用栈类似如下:

    [85899.344803] BUG: unable to handle kernel paging request at ffffffffc0b0ceef
    [85899.345643] IP: [<ffffffffc0b0ceef>] 0xffffffffc0b0ceef
    [85899.346119] PGD 24f212067 PUD 24f214067 PMD 24e421067 PTE 0
    [85899.346670] Oops: 0010 [#1] SMP 
    [85899.346982] Modules linked in: nfnetlink_queue nfnetlink_log bluetooth rfkill ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink xt_addrtype br_netfilter tcp_diag inet_diag xt_set ip_set_hash_ip tampercfg(OE) overlay(T) ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter iosf_mbi ppdev virtio_balloon crc32_pclmul parport_pc ghash_clmulni_intel parport shpchp i2c_piix4 aesni_intel lrw gf128mul glue_helper joydev
    [85899.354796]  ablk_helper pcspkr cryptd ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_console virtio_blk cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel virtio_pci i2c_core serio_raw virtio_ring floppy virtio [last unloaded: tampercore]
    [85899.358255] CPU: 2 PID: 1 Comm: systemd Tainted: G           OE  ------------ T 3.10.0-862.14.4.el7.x86_64 #1
    [85899.359264] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
    [85899.360050] task: ffff9880fa2c0000 ti: ffff9880fa2c8000 task.ti: ffff9880fa2c8000
    [85899.360817] RIP: 0010:[<ffffffffc0b0ceef>]  [<ffffffffc0b0ceef>] 0xffffffffc0b0ceef
    [85899.361636] RSP: 0018:ffff9880fa2cbd30  EFLAGS: 00010246
    [85899.362181] RAX: 0000000000000000 RBX: 000055a50e52e3c0 RCX: 0000000000000000
    [85899.362913] RDX: 0000000180080006 RSI: fffff786c5c52800 RDI: 0000000040000000
    [85899.363645] RBP: ffff9880fa2cbf48 R08: ffff9880f14a0000 R09: 0000000180080005
    [85899.364372] R10: 00000000f14a3001 R11: fffff786c5c52800 R12: ffff9880fa2cbd30
    [85899.365107] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    [85899.365840] FS:  00007fa181b3a940(0000) GS:ffff9883bfc80000(0000) knlGS:0000000000000000
    [85899.366669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [85899.367257] CR2: ffffffffc0b0ceef CR3: 000000024ed44000 CR4: 00000000003606e0
    [85899.367992] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [85899.368728] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [85899.369453] Call Trace:
    [85899.369726]  [<ffffffffa392579b>] system_call_fastpath+0x22/0x27
    [85899.370339] Code:  Bad RIP value.
    [85899.370729] RIP  [<ffffffffc0b0ceef>] 0xffffffffc0b0ceef
    [85899.371292]  RSP <ffff9880fa2cbd30>
    [85899.373188] CR2: ffffffffc0b0ceef
  • 问题原因

    内核或驱动访问非法内存。

  • 解决方案

  • 本页导读 (1)
文档反馈