故障排查经历:负载高CPU低

今天我们一台服务器出了问题,负载异常高。正常情况下它的负载应该是0.5 -1.5左右,但是今天突然变成了5-6。我上去看,cpu很闲,IO也很少啊。没看出任何异常。为啥呢。

后来用ps发现有几个进程始终处于uninterruptible sleep状态

$ ps aux|grep power
root 23840 0.0 0.0 0 0 ? D Feb06 0:01 [power_saving/0]
root 23841 0.0 0.0 0 0 ? D Feb06 0:01 [power_saving/1]
root 23844 0.0 0.0 0 0 ? D Feb06 0:00 [power_saving/2]
root 23845 0.0 0.0 0 0 ? D Feb06 0:00 [power_saving/3]

它们虽然不占CPU,但是会被记录到load中。恰好有4个进程,恰好比平时的负载大了4。

但是它们是从哪来的呢? 我向运维工程师申请了查看/var/log/message的权限,发现

Feb 6 01:48:03 xxx kernel: CPU5: Package power limit normal
Feb 6 01:48:03 xxx kernel: CPU17: Package power limit normal
Feb 6 01:48:03 xxx kernel: CPU23: Package power limit normal
Feb 6 01:50:25 xxx kernel: INFO: task kacpi_notify:179 blocked for more than 120 seconds.
Feb 6 01:50:25 xxx kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 6 01:50:25 xxx kernel: kacpi_notify D 0000000000000000 0 179 2 0x00000000
Feb 6 01:50:25 xxx kernel: ffff88032aa4db30 0000000000000046 0000000000000001 0000000000000001
Feb 6 01:50:25 xxx kernel: 0000000000000001 ffff8801e6187eb0 ffff88032aa4dab0 0000000000000000
Feb 6 01:50:25 xxx kernel: ffff88032aa44638 ffff88032aa4dfd8 000000000000fb88 ffff88032aa44638
Feb 6 01:50:25 xxx kernel: Call Trace:
Feb 6 01:50:25 xxx kernel: [] schedule_timeout+0x215/0x2e0
Feb 6 01:50:25 xxx kernel: [] ? acpi_os_release_object+0xe/0x12
Feb 6 01:50:25 xxx kernel: [] wait_for_common+0x123/0x180
Feb 6 01:50:25 xxx kernel: [] ? default_wake_function+0x0/0x20
Feb 6 01:50:25 xxx kernel: [] wait_for_completion+0x1d/0x20
Feb 6 01:50:25 xxx kernel: [] kthread_stop+0x4b/0xd0
Feb 6 01:50:25 xxx kernel: [] acpi_pad_idle_cpus+0xbc/0xd6 [acpi_pad]
Feb 6 01:50:25 xxx kernel: [] acpi_pad_handle_notify+0x96/0x196 [acpi_pad]
Feb 6 01:50:25 xxx kernel: [] ? dequeue_task_fair+0x12b/0x130
Feb 6 01:50:25 xxx kernel: [] ? __switch_to+0xd0/0x320
Feb 6 01:50:26 xxx kernel: [] ? thread_return+0x4e/0x76e
Feb 6 01:50:26 xxx kernel: [] ? acpi_hw_write_port+0x43/0x98
Feb 6 01:50:26 xxx kernel: [] ? acpi_os_execute_deferred+0x0/0x36
Feb 6 01:50:26 xxx kernel: [] acpi_pad_notify+0x1e/0x5b [acpi_pad]
Feb 6 01:50:26 xxx kernel: [] acpi_ev_notify_dispatch+0x64/0x71
Feb 6 01:50:26 xxx kernel: [] acpi_os_execute_deferred+0x29/0x36
Feb 6 01:50:26 xxx kernel: [] worker_thread+0x170/0x2a0
Feb 6 01:50:26 xxx kernel: [] ? autoremove_wake_function+0x0/0x40
Feb 6 01:50:26 xxx kernel: [] ? worker_thread+0x0/0x2a0
Feb 6 01:50:26 xxx kernel: [] kthread+0x96/0xa0
Feb 6 01:50:26 xxx kernel: [] child_rip+0xa/0x20
Feb 6 01:50:26 xxx kernel: [] ? kthread+0x0/0xa0
Feb 6 01:50:26 xxx kernel: [] ? child_rip+0x0/0x20

应该是个kernel的bug。不管了,重启再说。

此博客中的热门博文

少写代码,多读别人写的代码

在windows下使用llvm+clang

tensorflow distributed runtime初窥