产品信息:与产品型号无关
软件版本:存储系列软件版本
问题现象: messages 打印几次或者频繁打印 machine check ,亦或者频繁打印”EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 .......”;"Unified Memory Controller Error: DRAM ECC error"。
查看打印”machine check”或者 ” EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 ....... ”一端 SP 的 messages 信息,
搜索关键字“ machine check ”、”memory scrubbing error”、"Unified Memory Controller Error: DRAM ECC error", 有时是 machine check 单独打印,也有时是 "machine check 和EDAC MC0: 1 CE memory scrubbing error"或Unified Memory Controller Error: DRAM ECC error 一起打印,具体如下:
第一种情况:只单独打印 machine check
Mar 18 06:14:50 00-b3-42-03-a4-d7 kernel: mce: [Hardware Error]: Machine check events logged
Mar 22 06:18:14 00-b3-42-03-a4-d7 kernel: mce: [Hardware Error]: Machine check events logged
Apr 10 06:34:24 00-b3-42-03-a4-d7 kernel: mce: [Hardware Error]: Machine check events logged
Jun 23 07:37:23 00-b3-42-03-a4-d7 kernel: mce: [Hardware Error]: Machine check events logged
Jun 27 07:40:48 00-b3-42-03-a4-d7 kernel: mce: [Hardware Error]: Machine check events logged
Jun 28 07:41:39 00-b3-42-03-a4-d7 kernel: mce: [Hardware Error]: Machine check events logged
第二种情况:只单独打印EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 ....... ”
Jan 31 07:16:36 00-b3-42-03-3d-28 kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x194a0d6 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:4)
Jan 31 09:39:36 00-b3-42-03-3d-28 kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x194a0d6 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:4)
Jan 31 12:02:37 00-b3-42-03-3d-28 kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x194a0d6 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:4)
第三种情况,打印了 machine check 的同时还打印了 EDAC MC0: 1 CE memory scrubbing error的伴生打印:
Jun 28 13:51:38 00-b3-42-0f-15-56 kernel: mce: [Hardware Error]: Machine check events logged
Jun 28 13:51:38 00-b3-42-0f-15-56 kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0xc1a98f offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:2 rank:5)
Jul 4 18:03:45 00-b3-42-0f-15-56 kernel: mce: [Hardware Error]: Machine check events logged
Jul 4 18:03:45 00-b3-42-0f-15-56 kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0xc1a98f offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:2 rank:5)
Jul 23 11:55:03 00-b3-42-04-2e-ee kernel: mce: [Hardware Error]: Machine check events logged
Jul 23 11:55:03 00-b3-42-04-2e-ee kernel: [Hardware Error]: Corrected error, no action required.
Jul 23 11:55:03 00-b3-42-04-2e-ee kernel: [Hardware Error]: CPU:6 (18:1:1) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b-----machine check和 Unified Memory Controller Error一起打印
Jul 23 11:55:03 00-b3-42-04-2e-ee kernel: [Hardware Error]: Error Addr: 0x00000000d259fb40, Syndrome: 0x000001080a400603, IPID: 0x0000009600150f00
Jul 23 11:55:03 00-b3-42-04-2e-ee kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Jul 23 11:55:03 00-b3-42-04-2e-ee kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Jul 23 11:55:03 00-b3-42-04-2e-ee kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Apr 6 20:58:48 00-b3-42-04-2e-ee kernel: [Hardware Error]: Corrected error, no action required.
Apr 6 20:58:48 00-b3-42-04-2e-ee kernel: [Hardware Error]: CPU:6 (18:1:1) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b---- Unified Memory Controller Error单独打印
Apr 6 20:58:48 00-b3-42-04-2e-ee kernel: [Hardware Error]: Error Addr: 0x00000000cfe65380, Syndrome: 0x000001080a400603, IPID: 0x0000009600150f00
Apr 6 20:58:48 00-b3-42-04-2e-ee kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Apr 6 20:58:48 00-b3-42-04-2e-ee kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Apr 6 20:58:48 00-b3-42-04-2e-ee kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
第五种情况:messages或者crash文件中有"Machine Check Exception: f Bank 1: bd80000000100134"的打印
对于 2.1 中的四种情况,目前没有明确的定位结论,无法确认具体原因,此类打印有可能是内存、 CPU 、内存插槽或者 pcie 等其他问题异常引起,目前碰到此类打印的局点,多数仅仅是只有打印,未对设备运行或者功能造成影响。
1、符合如下情况的,建议持续观察:
a 、打印较少,不频繁,例如:”machine check”、”memory scrubbing error.....”或者"Unified Memory Controller Error: DRAM ECC error"任一打印一周打印不超过 25 次;
b 、打印非常频繁,并且已经持续超过 1 个月,未对设备运行或者功能造成影响;此时建议有条件的话拔插控制器后持续观察,不具备条件拔插的可持续观察;
2、符合如下情况的,建议控制器和内存条一起更换:
a、出现打印的控制器发生了异常重启,且在重启时间点附近有 machine check 相关打印,并收集好 mcelog 信息做好备份(HG产品不支持收集mcelog信息)。
b、messages或者crash文件中有"Machine Check Exception: f Bank 1: bd80000000100134"的打印
3、符合如下情况的,收集 mcelog 信息并联系研发评估处理:
重启过或者拔插过控制器后,还是频繁打印的。
全系列存储产品。
无。
修改日期 | 修改人 | 备注 |
2023-06-27 15:45:56[当前版本] | 莫发昌 | 其他原因... |
2023-05-11 15:21:07 | 莫发昌 | 增加其他类型打印 |
2023-03-16 09:44:45 | 莫发昌 | CREAT |