1 877 422 8729

Detect Critical Hardware Error

When monitoring key elements of your server infrastructure, the main aim is to create alerts to detect hardware errors. E-mail alerts are a good example of this practice.

When a hardware error is discovered in the Supermicro IPMI event log, there are many ways of implementing such alerts. You can script this at the OS level using a language of your choice, or you can even create SMTP alerts within the IPMI layer. Either way, the result is an e-mail alert whenever a critical hardware error is detected.

We chose to go the CentOS and bash route. We are specifically targeting Processor IERR entries in the IPMI event log. These errors usually indicate an issue with the CPU or RAM, and we want to know about them as soon as possible since in most cases, they will cause the system to crash and render the OS and/or server unusable.

These are the contents of the bash script (IERR_alert.sh) with some basic logic to pull the specific IERR event log entry:

The body of the e-mail is in the IERR_email.txt file and contains a basic message notifying your mailbox that a processor IERR has been detected in the IPMI event log. If the string ‘IERR’ is detected from the ‘ipmitool sel list’ output, the script triggers the /bin/mail tool to send an e-mail message contained in the IERR_email.txt file to a mailbox of your choice.

Next, we need to make the script executable:

chmod +x IERR_alert.sh

All that is left to do now is configure the script to automatically run every X number of minutes to look for the IERR. We are doing every 30 minutes.

Just like the notification setup, there are various ways to have the script run automatically at a set interval. One way would be to create a cron job to run the script. Another method is to add some logic within the script so that you can execute it once and it will run in a loop while searching for the IERR.

In this case we went with the ‘watch’ command, which can run the script every 30 minutes. Below, we are receiving a “No errors detected” because the IERR has not yet occurred.

watch –n 1800 ./IERR_alert.sh

To reiterate, there are many ways to detect critical hardware errors. This method might not be ideal for everyone, but it did make for a fun blog topic.

 

*Note
In this example, the script is running on the OS of the server being monitored. This is not a good idea in production since in many cases, the OS will freeze before the script detects the IERR in the IPMI event log. In that scenario, you will not receive any alerts. If you were to use this method, running the script from a centralized monitoring server is best.