Diagnosing memory errors with IPMI

SUMMARY

Diagnosing memory errors with IPMI

ISSUE

Newer Unitrends DPU platforms use IPMI firmware which can log memory errors. For example:

Recovery-712

Recovery-713

Recovery-813

Recovery-822

Recovery-823

Recovery-833-100

Recovery-833-200

Recovery-943

Use IPMI commands to see memory errors in the firmware log.

RESOLUTION

  1. Download an updated ipmiutil. Skip this step if ipmiutil-3.0.0 or later is already installed.
    • For CentOS 6:
      wget ftp://ftp.unitrends.com/support/Hotfixes/ipmiutil-3.0.0-1_el6.x86_64.rpm
    • For CentOS 5:
      wget  ftp://ftp.unitrends.com/support/Hotfixes/ipmiutil-3.0.0-1_el5.x86_64.rpm
  2. Update the RPM package:
    rpm -U ipmiutil-3.0.0*.rpm
  3. Look for any recent memory events:
    ipmiutil sel -e
    

Below is sample output of a CPLD error, which is usually caused by a memory fault.
RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data]
000a 04/10/13 15:03:41 CRT BMC   #ff CPLD CATERR Asserted 6f [a0 1c ff]
 

Below is sample output of a memory ECC error.  In this event, an offline memory test with a minimum of four clean passes should be run.

RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data]
7840 08/09/11 15:10:47 MIN BMC  Memory #08 Uncorrectable ECC, DIMM6/CPU1 6f [20 ff 10]
 

The DIMM should be more accurate and easier to interpret in 3.0.0, as shown below.  This error is typically not a memory fault but rather bad data being passed to memory.  Review the operating system logs (messages), dmesg and other application logs (/usr/bp/logs.dir) to determine the source of these errors.

ipmiutil ver 3.00
ievents version 3.00
RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data]
7840 08/09/11 15:10:47 MIN BMC  Memory #08 Correctable ECC, P1_DIMMF1 6f [20 ff 50]
 

CPLD events are not DIMM-specific, but if this is an ECC error event, then the faulty DIMM may be indicated by the event, so replace the specified DIMM.

CAUSE

The BIOS detects a memory error, either with ECC or with CPLD, and logs it to the IPMI firmware system event log (SEL). 

NOTES

See http://ipmiutil.sourceforge.net for a UserGuide and other files.
For more information, see Using IPMI LAN for remote access 

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Contact us