Chaos that Brings Order
2014 R&D 100 Winner
Large computing systems, including supercomputers, cloud clusters and server farms, are built using thousands of hybrid nodes, each with multicore central processing units (CPUs) and possibly graphics processing unit (GPU) accelerators. Currently, they’re experiencing frequent hardware component faults, and somewhat unexpectedly, it has become a major challenge to ensure accurate, error-free computations on them, particularly as progress is made toward exascale systems.
Oak Ridge National Laboratory’s DUCCS is ultra-efficient software that utilizes highly parallel chaotic map computations to quickly (in a few minutes) and efficiently detect component faults in computing units, memory elements and interconnects of hybrid CPU-GPU computing systems. This transportable software is based on an original, creative design that combines the chaotic map theory from physics and mathematics with the advanced programming software of CPU-GPU systems. Detected faults information can be used to work around or replace faulty parts, and render the applications resilient by supporting checkpoint recovery and migration to fault-free zones.
Oak Ridge National Laboratory
|Oak Ridge National Labroatory's DUCCS development team: Nageswara S. Rao.|
The DUCCS Development Team from Oak Ridge National Laboratory
Nageswara S. Rao, Principal Developer