A team of scientists hailing from the Sandia National Laboratories and Boston University developed an experimental algorithm that could automatically diagnose problems in supercomputers.

There is an array of internal and external issues that could arise with these powerful machines. For instance, factors like physical parts breaking can occur or previous programs performing “zombie processes” that prevent the computer from functioning properly.

Furthermore, the repair process for these devices can take an extended period of time, which raises another issue since these computers perform critical tasks like forecasting the weather and ensuring the U.S. nuclear arsenal is safe and reliable without needing to do underground testing.

To develop the algorithm, the team took a multi-step approach.

First, the engineers created a suite of issues they became familiar with over the time they spent working on various supercomputers, which was then followed by them writing specific codes to re-create these anomalies. 

Two supercomputers, one residing at Sandia and a public cloud system that Boston University helps operate, ran a variety of programs with and without the anomaly codes. A large quantity of data points were collected in this process including how much energy, processor power, and memory was used in each node.

Next, this trove of information was programmed into several machine learning algorithms which were able to detect anomalies by comparing data from normal program runs and those with anomalies.

In addition, these specialized programs were given additional training to determine which one was the best at diagnosing these problems.

One technique that was highlighted is called Random Forest. It was adept at analyzing vast quantities of monitoring data, identifying which metrics are important, and then determining if the supercomputer was being affected by anomaly.

Ultimately, the analysis process was further streamlined by incorporating calculations of various statistics for each metric including values like average, fifth percentile, and 95th percentile, along with more complex indications like noisiness as well as trends over time and symmetry that help suggest abnormal behavior.

The end result was a trained machine learning program that could use less than one percent of the system’s processing power to analyze data and find these complexities.

Future work on this prototype would entail more work with artificial anomalies while also finding ways to validate these diagnostics to gauge their performance in finding real anomalies during normal runs on these supercomputers.

The research team’s paper was published in the journal High Performance Computing.