Intel CPU的BUG导致reboot起不来

这个BUG是我去年11月撞见的,早该写出来了。因为这个BUG造成的灾难后果远远超出我的想像。

当时的现象是某些机器重启后起不来,/var/log/message中有这样的信息:

Nov 15 03:46:09 kernel: INFO: task sh:7684 blocked for more than 120 seconds.
Nov 15 03:46:09 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 15 03:46:11 kernel: Call Trace:
Nov 15 03:46:11 kernel: [] ? ext4_file_open+0x0/0x130 [ext4]
Nov 15 03:46:11 kernel: [] schedule_timeout+0x215/0x2e0
Nov 15 03:46:12 kernel: [] ? nameidata_to_filp+0x54/0x70
Nov 15 03:46:12 kernel: [] ? cpumask_next_and+0x29/0x50
Nov 15 03:46:12 kernel: [] wait_for_common+0x123/0x180
Nov 15 03:46:12 kernel: [] ? default_wake_function+0x0/0x20
Nov 15 03:46:13 kernel: [] wait_for_completion+0x1d/0x20
Nov 15 03:46:13 kernel: [] sched_exec+0xdc/0xe0
Nov 15 03:46:13 kernel: [] do_execve+0xe0/0x2c0
Nov 15 03:46:13 kernel: [] sys_execve+0x4a/0x80
Nov 15 03:46:13 kernel: [] stub_execve+0x6a/0xc0

上网一查,发现这是一个已知的BUG, 请见 http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-family-spec-update.pdf 里面的BT81,我摘抄如下:

BT81. TSC is Not Affected by Warm Reset

Problem: The TSC (Time Stamp Counter MSR 10H) should be cleared on reset. Due to this erratum the TSC is not affected by warm reset.

Implication: The TSC is not cleared by a warm reset. The TSC is cleared by power-on reset as expected. Intel has not observed any functional failures due to this erratum.

它说的理直气壮好像没事。

实际上只要满足以下三个条件:

  1. 操作系统为Red Hat Enterprise Linux 6.1 - 6.4。(6.5及以上没问题)

  2. CPU属于Intel® Xeon® E5, Intel® Xeon® E5 v2, 或 Intel® Xeon® E7 v2 系列。

  3. 大约200天以上没有断电重启过。(是指没有hard reset。远程在Linux里敲reboot不算是)

就会导致操作系统reboot失败。临时的解决办法就是:找人去机房,断电,然后再起来。

具体请参见Red Hat的声明:https://access.redhat.com/solutions/433883

如果你对比以上条件发现自己中招了,赶紧升级kernel吧。

此博客中的热门博文

少写代码,多读别人写的代码

在windows下使用llvm+clang

tensorflow distributed runtime初窥