Los Alamos National Laboratory recently formed the Efficient Mission Centric Computing Consortium (EMC3) to investigate ultra-scale computing architectures, systems and environments that can achieve higher efficiencies in extreme-scale mission-centric computing.
“We are excited about EMC3 and seek partnerships with high performance computing (HPC) technology providers and consumers that are interested in pushing the HPC efficiency and productivity envelope,” said Gary Grider, leader of the High Performance Computing division at Los Alamos.
EMC3 will focus on the most demanding multi-physics applications involving largely unstructured/sparse problems that require a balance of compute, memory size, memory bandwidth, network and I/O. This will help HPC consumer organizations and HPC researchers/system developers to collaborate in addressing the challenging problem of higher-efficiency, extreme-scale, mission-centric computing.
Data Direct Networks (DDN) is the first to join the Laboratory as a collaborator in EMC3. They will co-fund computer science/mathematics students who will investigate the challenges dealing with storage device failures in the massive scale/massively parallel use-case setting.
“We are happy to have our inaugural HPC technology provider partner, DDN, in EMC3. The area of management of massive data storage farms to provide efficient and reliable access to immense data is a fundamental capability that must scale with the rest of our environment,” Grider said. “We and other large storage sites are experiencing seemingly correlated failure events where we lose tens to hundreds of drives in a short period of time and we must understand this phenomenon and find an efficient and economical way to survive and thrive in this environment.”
Typical storage device failure modeling only considers random and uncorrelated storage device failure guided by bathtub curves for failure rates. Similarly, massive storage deployments that use storage devices in massively parallel ways also experience both spatially correlated storage device failure and non-spatially correlated device failure.
An arc flash in a rack that takes out many storage devices at once is an example of spatially correlated failure. Non-spatially correlated failure could occur if a set of disks was deployed in a particular manner.
Additionally, the approach to protection schemes like erasure coding differs radically for each of these failure scenarios. A model is desperately needed to understand and explore this rapidly approaching issue.
EMC3 will continue to employ and develop tools, resources, and skillsets to investigate ultra-scale computing architectures, systems, and environments to provide the appropriate balance for the most demanding applications. This will accelerate the progress on efficiency for U.S. industry bread and butter applications and for national security computing.
All U.S. HPC industry base consumers and national and international HPC component and system developers are encouraged to join EMC3.
Read more about Los Alamos National Laboratory’s EMC3 effort.
DataDirect Networks (DDN) has been the world’s leading big data storage supplier to data-intensive, global organizations for over 20 years. DDN designs, develops, deploys and optimizes systems, software and storage solutions that enable enterprises, service providers, universities and government agencies to generate more value and to accelerate time to insight from their data and information, on premise and in the cloud. Organizations leverage the power of DDN storage technology and the teams deep technical expertise to capture, store, process, analyze, collaborate and distribute data, information and content at the largest scale in the most efficient, reliable and cost-effective manner.
Los Alamos National Laboratory, a multidisciplinary research institution engaged in strategic science on behalf of national security, is managed by Triad, a public service oriented, national security science organization equally owned by its three founding members: Battelle Memorial Institute (Battelle), the Texas A&M University System (TAMUS), and the Regents of the University of California (UC) for the Department of Energy’s National Nuclear Security Administration.
Los Alamos enhances national security by ensuring the safety and reliability of the U.S. nuclear stockpile, developing technologies to reduce threats from weapons of mass destruction, and solving problems related to energy, environment, infrastructure, health, and global security concerns.