Machine Check Exception
Andrew Mclaughlin
I am running Ubuntu Server on a Dell PowerEdge server. I found following the log entry from the server dmesg. The Dell Pro Support requested to run Dell's DSET diagnostics. They found no hardware problems reported by DSET and the support person said that this log message is a reporting problem in Ubuntu. Can this be a software bug in Ubuntu?
Thanks
Sami
[1457944.748752] sbridge: HANDLING MCE MEMORY ERROR<br>
[1457944.748761] CPU 1: Machine Check Exception: 0 Bank 10: 8c000046000800c1<br>
[1457944.748763] TSC 0 ADDR 2df41c3000 MISC 900080008000c8c PROCESSOR 0:306e4 TIME 1395313612 SOCKET 1 APIC 20<br>
[1457945.659958] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Channel#1_DIMM#0 (channel:1 slot:0 page:0x2df41c3 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 channel_mask:1 rank:0)<br> 1 3 Answers
I have an update to this issue. Finally the problem was found and the cause was a faulty DIMM module. Interestingly none of Dell diagnostics tests revealed this problem.
According to Dell, the EDAC software actually hides the error from Dell's own hardware tools. You have to blacklist the module to get it to pass through.
Probably hardware related bug.
- Fedora bugzilla. From comments a method in diagnosing:
After a lot of diagnostics and working with vendor support, it appears this is almost certainly a hardware problem with some versions of X9DR3-LN4+ motherboards.
The problem boards report "REV:1.10" as their Version in 'dmidecode -t baseboard'.
At our site, older boards with a Version of "0123456789" have not produced the errors, and we are replacing the faulty boards with newer boards of the same model, Version "REV:1.20A".
On the faulty motherboards, the errors seems to manifest mostly with the higher speed 2.90 GHz E5-2690 processors and full (24 RDIMMM) RAM configs, but we have been able to reproduce it with fewer RDIMMs.
FWIW, memtester did not generate the errors; the method i hit upon was just to exercise the buffer cache. So on a system with 384 GB of RAM, i'd put about 400 GB of data in a local file system mounted at
/scratch, and do:while true ; tar cf - /scratch | cat - >/dev/null ; done(In my experiments, writing to /dev/null from tar would not work... the "cat - >/dev/null" was required.) While this is running, you can check the error counts with this:
cat /sys/devices/system/edac/mc/mc?/ce*countThe observed Error rate was usually at least one MCE error per hour
- Some more possible checks you can perform: I'm getting MCE (Machine Check Exception) errors, what does this mean?
.