The race for the best supercomputer is about more than just power and processing speed. Energy efficiency is also a critical element that should be a large consideration in high performance computing (HPC), said Natalie Bates, one of leaders of the Energy Efficient High Performance Computing Working Group (EE HPC WG).

“Energy costs, operations and capital, are an increasing portion of HPC total cost of ownership, said Bates in an interview with R&D Magazine. “That’s what drove the focus on Power Usage Effectiveness (PUE) in commercial data centers, and the same dynamics apply as HPC systems have gotten larger.  Any HPC center with a PUE over 1.1 or 1.2 is leaving money on the table that could be put to use more productively.” 

The EE HPC WG was created in 2009 by a small group of U.S. Department of Energy national laboratories, and today includes 700 member from 25 countries. The group’s goal is to drive implementation of energy efficient design in HPC by encouraging the HPC community to lead in energy efficiency as they do in computing performance, developing and disseminating best practices for maximizing energy efficiency in HPC facilities and systems, and serving as forum for the sharing of information and collective action.

In an interview with R&D Magazine, Bates explained more about the importance of ‘green’ HPC and the steps that need to be taken to get there.

R&D Magazine: Why is the initiative to improve energy efficiency in HPC/supercomputing important?

Bates: Moore’s law predicts exponential increases in circuit density over time. This has two consequences: hardware costs less—so more circuits can be grouped to solve equations—but energy, and waste heat densities grow exponentially too. Classical physical scaling effects and the impact of leakage current scaling are making fueling these massively dense machines, and extracting waste heat, dominant concerns for high performance computing (HPC) designers and operators.

As these machines have connected to the electrical grid, new problems have become obvious.  These are unique loads; 10-17 Megawatts (MW) today and up to 50 MW in the near future—with switching 10+ MW loads on and off in milliseconds. We have to design and operate these power-behemoths to fit into the electrical grid that isn’t designed for them.

R&D Magazine: What are some best practices that can be taken to improve the energy footprint in HPC/supercomputing?

Bates: First, make energy efficiency a cross-organizational focus from strategic planning to operational execution.  Design teams have to include facilities engineers at early stages.

Second, focus on improving PUE with facility-based improvements.  These can yield large results with established practices like reducing compressor based cooling and minimizing power conversions.

Third, implementing measurement and monitoring capabilities for power, energy and related data is absolutely necessary for continuous improvement of the energy footprint.   

Fourth, leverage the vendor community.  HPC system improvements have been very strong, as evidenced by improving FLOPS per Watt performance on the Green500 List, but making energy efficiency a consideration during the procurement process is critical

(Note: FLOPS is an acronym for floating point operations per second, which is a measure of computer performance.)

Beyond that, there are huge improvements to be realized in developing capabilities to optimize applications for both time and energy to solution.

R&D Magazine: What are some design changes that be made to future HPCs or supercomputers to make them more environmentally friendly?

Bates: If you can't measure it, you can't improve it. We are at the point where measurement capabilities are allowing us to see a wealth of information and identify the tall poles in the tent. We have made major gains in improving the energy efficiency of the facility and computing hardware, but large gains can still be made with software, particularly application software.  

We are currently working on two major power management design initiatives: power capping facilities and the ability to manage energy on a node dynamically. 

Power capping is key for many HPC computing centers, particularly for the world’s largest supercomputers.  By carefully managing power, sites can stay within the contractually agreed ranges with their electricity providers.  They can also use managing power to optimize both capital and operational costs of cooling and electricity infrastructure.

Another key feature needed for improved efficiency is power/energy state management on individual nodes, which can be very useful for application-level tuning.  Having a baseline for design needs in supercomputers on this front is essential, so that energy-saving techniques that are developed can be applied to any system architecture.

R&D Magazine: What are the barriers to implementing these design changes/best practices?

Bates: The barriers to creating energy-efficient HPC are the scaling limitations of silicon-based technologies that are used for fabricating the components of HPC systems. Quantum computing, optical computing, biomolecular computing are just some of the many alternative methods of computing that could be much more energy efficient than today’s silicon-based computing.  Unfortunately, all of these methods of computing are in early stages of research and although they are important to pursue, none of them can be counted on to help with creating more energy-efficient HPC for the near future.

A more immediate opportunity is what Barroso and Hozle from Google call “energy proportional computing”; that is, using resources in a balanced way so that all the energy consumed is optimally doing useful work.  One simple and relevant way of describing this opportunity is to think about reducing idle power, which has been decreasing as a percent of peak power, but is still quite high.

There is the question of site location; can we leverage the climate advantages as well as renewable electricity sources of locations like Iceland for siting our systems?

Creating more energy efficient HPC is a continuous improvement process that requires the right tools for measuring, taking action, checking the results and iterating in a virtuous cycle.  We have made a lot of progress with the measurement tools, and now we need to learn to use them.

There is also a lack of standardization and metrics for energy-efficiency.  Depending on the target group, the expectations and goals differ.  Should we care about FLOPS per Watt, or cost per Watt, or Energy-Delay products, or exceeding a power budget, or science accomplished per watt or utilizing allocated power well? 

R&D Magazine: As technology advances do you see the energy efficiency challenge with HPC/supercomputers getting easier? Harder?

Bates: We’ve been making general purpose HPC system designs at high cost and development effort. John Hennesy in his recent Turing Award speech has pointed out that we need to get back into making ever more specialized accelerators—way beyond Graphics Processing Units (GPUs)—in order to continue to extract performance from computing hardware, and we need to bring down their development cost. We also must develop programming environments that substantially reduce the programming barriers to using such specialized GPUs.

The next generation of systems being designed today will consume 30-50MW (vs. 5-15MW systems we run today). Energy efficiency and power management will be key to reduce operating costs. A major focus will be to select the right hardware architecture to provide the most energy efficient computing for different applications. 

R&D Magazine: Is more education needed on this topic?

Bates: Yes.  HPC is one leg of the stool that might save the earth for human habitation, or it might be the key to colonizing another planet. It has never been more central to our lives and future. 

R&D Magazine: What are the goals of Energy Efficient High Performance Computing Working Group?

Bates: It is a new way to multiply the work of technologists working separately, and is an accelerator for the traditional journal/conference model. I think it can be used in many other technological hotspots. The secret to success is the telephone-team that can work on finite tasks in real time.  It allows for very effective participation; there are countless hours of volunteer time devoted to the public good.  

This interview has been edited for length and clarity