大佬教程收集整理的这篇文章主要介绍了linux – 如何从MCE消息中找到故障内存模块?,大佬教程大佬觉得挺不错的,现在分享给大家,也给大家做个参考。
Apr 13 22:39:22 m@L_450_0@ kernel: [36247975.116860] sbridge: HANDLING MCE MEMORY ERROR Apr 13 22:39:22 m@L_450_0@ kernel: [36247975.116867] cpu 0: Machine check Exception: 0 Bank 5: 8c00004000010090 Apr 13 22:39:22 m@L_450_0@ kernel: [36247975.116869] TSC 0 ADDR 4a0d75900 MISC 21405cdc86 PROCESSOR 0:206d7 TIME 1428957562 SOCKET 0 APIC 0 Apr 13 22:39:22 m@L_450_0@ kernel: [36247975.951013] EDAC MC0: 1 CE memory read error
我怀疑一个坏的内存模块.服务器是2x Xeon E5-2650,带有8x8Go内存模块(每个cpu有8个内存插槽)
这是lshw的内存模块数量:
*-memory:0 description: System Memory physical id: 2d slot: System board or motherboard *-bank:0 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-197.A vendor: Kingston physical id: 0 serial: B83AE5C2 slot: P1_DIMMA1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:1 description: DIMM Synchronous [empty] product: Dimm1_PartNum vendor: Dimm1_Manufacturer physical id: 1 serial: Dimm1_SerNum slot: P1_DIMMA2 width: 64 bits *-bank:2 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 2 serial: EC309238 slot: P1_DIMMB1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:3 description: DIMM Synchronous [empty] product: Dimm4_PartNum vendor: Dimm4_Manufacturer physical id: 3 serial: Dimm4_SerNum slot: P1_DIMMB2 width: 64 bits *-bank:4 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 4 serial: E9305438 slot: P1_DIMMC1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:5 description: DIMM Synchronous [empty] product: Dimm7_PartNum vendor: Dimm7_Manufacturer physical id: 5 serial: Dimm7_SerNum slot: P1_DIMMC2 width: 64 bits *-bank:6 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 6 serial: E7305738 slot: P1_DIMMD1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:7 description: DIMM Synchronous [empty] product: Dimm10_PartNum vendor: Dimm10_Manufacturer physical id: 7 serial: Dimm10_SerNum slot: P1_DIMMD2 width: 64 bits *-memory:1 description: System Memory physical id: 3f slot: System board or motherboard *-bank:0 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-197.A vendor: Kingston physical id: 0 serial: B63A08C3 slot: P2_DIMME1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:1 description: DIMM Synchronous [empty] product: Dimm1_PartNum vendor: Dimm1_Manufacturer physical id: 1 serial: Dimm1_SerNum slot: P2_DIMME2 width: 64 bits *-bank:2 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 2 serial: EA309638 slot: P2_DIMMF1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:3 description: DIMM Synchronous [empty] product: Dimm4_PartNum vendor: Dimm4_Manufacturer physical id: 3 serial: Dimm4_SerNum slot: P2_DIMMF2 width: 64 bits *-bank:4 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 4 serial: E7305938 slot: P2_DIMMG1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:5 description: DIMM Synchronous [empty] product: Dimm7_PartNum vendor: Dimm7_Manufacturer physical id: 5 serial: Dimm7_SerNum slot: P2_DIMMG2 width: 64 bits *-bank:6 description: DIMM DDR3 1333 MHz (0,8 ns) product: 9965516-048.A vendor: Kingston physical id: 6 serial: E7305B38 slot: P2_DIMMH1 size: 8GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:7 description: DIMM Synchronous [empty] product: Dimm10_PartNum vendor: Dimm10_Manufacturer physical id: 7 serial: Dimm10_SerNum slot: P2_DIMMH2 width: 64 bits *-memory:2 UNCLAIMED physical id: 7 *-memory:3 UNCLAIMED physical id: 9
您可以注意到,#5银行没有内存模块.所以我的问题是:你是否同意这条消息是关于内存故障的?如果是这样,我怎样才能找到要替换的模块?
您收到的事件是CE事件(可识别的错误).这些都表明DIMM开始出现故障.
EDAC没有报告任何关于它所引用的内存行或通道的具体信息,因此很难确定哪一个要替换,直到那个失败.
但是看看:/ sys / devices / system / edac / mc / mc *这可能会告诉你更多关于哪个行/ dimm可能是错误的行/ dimm.
例如
ls -s / sys / devices / system / edac / mc / mc0
总共0
0 ce_count 0 csrow1 0 csrow4 0 csrow7 0 reset_counters 0 size_mb
0 ce_noinfo_count 0 csrow2 0 csrow5 0 device 0 sdram_scrub_rate 0 ue_count
0 csrow0 0 csrow3 0 csrow6 0 mc_name 0 seconds_since_reset 0 ue_noinfo_count
看一下ce_count字段.
在旁注:
系统仍然可以继续运行,但安全性较低.展示CE的内存DIMM的预防性维护和主动部件更换可以降低可怕的UE(不可纠正的错误)事件和系统“恐慌”的可能性.
有关edac的更多信息:
以上是大佬教程为你收集整理的linux – 如何从MCE消息中找到故障内存模块?全部内容,希望文章能够帮你解决linux – 如何从MCE消息中找到故障内存模块?所遇到的程序开发问题。
如果觉得大佬教程网站内容还不错,欢迎将大佬教程推荐给程序员好友。
本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。