Performance improvements and energy efficiency for HPC data centers

Scientific Computing magazine’s latest Q&A article addresses a topic very important to most professionals who own or operate a data center: how to improve performance and save money on energy costs.

The magazine asked seven industry leaders in the HPC industry for their opinions about this subject. Their answers are very telling and reveal the state of data center technology today and the directions it will likely take in the near future.

Reducing the cost of power consumption

Perhaps the greatest expenditure facing data center operators today is in the electric bill they pay to keep their servers running. While HPC (high-performance computing) servers have become relatively cheaper in recent years with such innovations as GPUs (graphics processing unit), the price of powering ever-expanding data centers has only grown.

There are several bottlenecks that are contributing to these higher costs. The most common problem mentioned by the industry experts in the Q&A article is how to reduce the cost of the cooling systems in server rooms. As Bob Masson of Convey Computer noted, “Every watt required to power a server nominally requires a watt to cool it.”

For this reason, VP of HPC Consulting at NAG Andrew Jones recommends that data center operators make a long-run analysis before they buy new equipment. The price of a server isn’t just what you pay on the day you buy it. The cost also includes what it will take to power that server and cool it over time (and, if you’re buying dozens or hundreds of servers, this begins to add up significantly). That is the true cost of a server, and from this perspective, power and cooling specs become very important.

This kind of an analysis is not only useful before buying a server but also for evaluating an existing data center. Old equipment, as Blake Gonzalez of Dell HPC describes, can be so expensive to run due to power inefficiencies that some companies can save money by buying entirely new servers altogether. But before any such decisions can be made, Gonzalez notes, one has to be able to measure current power consumption accurately and know how to estimate the long-run cost of replacement servers before buying them.

But there are other issues as well. Steve Scott of Cray describes how the conversions from AC power to DC power (which most of the electronics in servers use) wastefully causes power to be lost in the process. By developing technology to make the conversions more efficient, data centers will be able to save money on electricity.

Processors have also become more efficient in the past few years through innovations such as combining several processing cores into one CPU (the Opteron 6100 series, for instance, features a 12-core processor). The current focus in processors, though, is how to divide up the tasks that they are called on to perform in a better way.

GPUs (graphics processing units) are a great example of this. By sending the myriad of simple and repetitive computing tasks to the GPU, the CPU is available to undertake the more complicated calculations. The resulting increase in performance can save data centers money by requiring less servers than before to accomplish the same tasks.

Scott and Masson both discuss the need to not only develop more powerful CPUs, but also use them in smarter ways. A server cluster, for instance, can contain different types of processors that specialize in certain types of tasks.This approach has been adopted by companies who have built systems with Intel Atom processors, which process high volumes of simple tasks for a lower cost than the higher-end Intel CPUs.

Diversifying the processor types in a cluster and programming them better to divide up tasks between their components can increase data center performance in the future.

Using HPC applications to increase hardware efficiency

When the industry leaders were asked if they think HPC applications needed to be programmed with a focus on hardware efficiency, their opinions differed significantly.  Michael Wolfe of The Portland Group argued that “decisions, such as what hardware components to power off are better made at the system level”. Applications, in other words, should not be responsible for how power is utilized by the server hardware.

Andrew Jones disagreed, saying that this type of programming is “a key opportunity” for reducing the price of running a data center. An application can store several algorithms for the same job, and decide which one is best to use based on the power consumption available at different times of day. Bob Masson called such innovation in applications “mandatory” because hardware improvements, such as with processors, “don’t address the fundamental issue – that is, for a given application, a given sequence of general-purpose instructions must be executed.” It’s more effective to fix how these instructions are carried out, Masson thinks, than to change what hardware is executing those tasks.

The opinions on this subject among the experts cleaved into two differing schools of thought, but it is clear that some kind of optimization of applications is necessary to lower power consumption by either directly regulating what hardware runs various tasks or indirectly improving the efficiency of the application itself. Steve Scott offered the most useful insight into how programmers can keep in mind the latest innovations in parallel computing and heterogeneous-processor clusters:

The challenge for HPC application developers will be to map their applications onto the new generation of heterogeneous processors with exposed memory hierarchies. This must be done in a way that is portable across machine types and forward in time. We should not ask users to write their codes to map onto a particular machine. Rather, they should write their code such that parallelism is exposed to the system software at as high a level as possible. Within each node of a distributed application (e.g.: one using MPI), the code should be written so that loops at the highest level in the call tree can be safely performed in parallel. Programmers should also think about locality, meaning that computations can work on data “near” them, and re-use data multiple times once it is referenced from memory. Sophisticated system software can then map the problem onto the low-level hardware in a way to exploit maximal parallelism and reduce data movement.

Green and efficient technology

When asked about emerging “green” technologies for data centers, the Q&A respondents summarized the major innovations that the industry can expect in the next few years.

They mentioned a few bottlenecks that are hindering improving the efficiency of today’s servers: cooling systems, processors, memory, and power delivery systems.

Steve Kinney of IBM notes that standard raised-floor server racks are being replaced in many data centers with open air cooling systems, using the cooler outdoor air (usually these data centers will be located in cooler climates) to regulate the temperature of the machines. Blake Gonzalez also agreed that more and more data centers will be equipped with such new cooling technologies.

Memory and power delivery systems need to be improved as well, as these industries appear (to me, at least) to  innovate more slowly than say the processor industry. But reading the opinions of the experts in the Q&A, I get the sense that server hardware as it is today is reaching a kind of plateau in terms of performance capability. Bob Masson says, “Some form of heterogeneous computing is inevitable. The laws of physics dictate that processor clock rates cannot substantially increase (because it is physically impossible to dissipate the heat generated beyond a certain point).” There is a limit to how far we can manipulate silicon, in other words, to create ever-powerful machines. We have to innovate elsewhere.

As Masson suggests, the greatest breakthroughs in the next few years will come with improving how we use current hardware, without necessarily major change in the hardware itself. Heterogeneous computing, systems and clusters which use different kinds of computational hardware (different processors, GPUs, etc.), “is arguably the next green technology” according to Masson. He says further:

No other technique or technology exists that can provide hundreds or thousands of times improvement in performance/watt for HPC applications [. . .] Performance must be obtained by re-appling a given number of transistors (and hence a given number of watts) to be more efficient. What remains is to make heterogeneous computing usable and available to “the masses” without undue reprogramming or drastic changes to the application development ecosystem.

As new technologies emerge, it’s important to easily integrate them into existing data centers. In addition, they must be affordable enough to be worth the performance boost. Many data centers, if they quantify their current power expenditures over time, will find that perhaps even the more expensive innovations (like solid-state drives or new cooling systems) will pay themselves off in the long run. But hardware vendors who are coming out with the cutting-edge technology described by the industry leaders in the Q&A article still need to make sure that their wares are affordable enough for widespread adoption.

Comments
1 comment
  • Graphics 08.07.2013

    I think it really reduce power cost and increase performance. That’s very essential for data center.