Velvet Star Monitor

Standout celebrity highlights with iconic style.

general

Machine Check Exception

Writer Andrew Mclaughlin

I am running Ubuntu Server on a Dell PowerEdge server. I found following the log entry from the server dmesg. The Dell Pro Support requested to run Dell's DSET diagnostics. They found no hardware problems reported by DSET and the support person said that this log message is a reporting problem in Ubuntu. Can this be a software bug in Ubuntu?

Thanks

Sami

[1457944.748752] sbridge: HANDLING MCE MEMORY ERROR<br>
[1457944.748761] CPU 1: Machine Check Exception: 0 Bank 10: 8c000046000800c1<br>
[1457944.748763] TSC 0 ADDR 2df41c3000 MISC 900080008000c8c PROCESSOR 0:306e4 TIME 1395313612 SOCKET 1 APIC 20<br>
[1457945.659958] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Channel#1_DIMM#0 (channel:1 slot:0 page:0x2df41c3 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 channel_mask:1 rank:0)<br>
1

3 Answers

I have an update to this issue. Finally the problem was found and the cause was a faulty DIMM module. Interestingly none of Dell diagnostics tests revealed this problem.

According to Dell, the EDAC software actually hides the error from Dell's own hardware tools. You have to blacklist the module to get it to pass through.

Probably hardware related bug.

After a lot of diagnostics and working with vendor support, it appears this is almost certainly a hardware problem with some versions of X9DR3-LN4+ motherboards.

The problem boards report "REV:1.10" as their Version in 'dmidecode -t baseboard'.

At our site, older boards with a Version of "0123456789" have not produced the errors, and we are replacing the faulty boards with newer boards of the same model, Version "REV:1.20A".

On the faulty motherboards, the errors seems to manifest mostly with the higher speed 2.90 GHz E5-2690 processors and full (24 RDIMMM) RAM configs, but we have been able to reproduce it with fewer RDIMMs.

FWIW, memtester did not generate the errors; the method i hit upon was just to exercise the buffer cache. So on a system with 384 GB of RAM, i'd put about 400 GB of data in a local file system mounted at /scratch, and do:

while true ; tar cf - /scratch | cat - >/dev/null ; done

(In my experiments, writing to /dev/null from tar would not work... the "cat - >/dev/null" was required.) While this is running, you can check the error counts with this:

cat /sys/devices/system/edac/mc/mc?/ce*count

The observed Error rate was usually at least one MCE error per hour

.

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy