LOS ALAMOS, N.M., June 14, 2021—Through ongoing collaboration between Los Alamos National Laboratory and Hewlett Packard Enterprise (HPE), laboratory researchers are now able to use the power of cloud technologies to more efficiently conduct complex scientific research using high-performance computing applications. These technologies allow administrators to perform upgrades and maintenance to computing systems without interfering with critical ongoing work.
“By leveraging Linux software containers and container orchestration in both user space and for system management, the Laboratory’s latest Institutional high-performance computing system, named Chicoma, is now providing hundreds of users with greater flexibility than was available on previous generation systems,” said Gary Grider, Los Alamos’ HPC division leader.
Chicoma is one of the first systems deployed using the HPE Cray EX supercomputer, which also leverages the HPE Cray System Management, a next-generation software stack with management capabilities and other related services. The HPE Cray System Management minimizes downtime and allows admins to use continuous integration techniques to safely upgrade, patch, and refine systems without interrupting user productivity. When coupled with a cloud services model, it delivers better manageability, reliability, availability, and resiliency.
“A resilient and well-versioned management plane has the potential to almost eliminate system downtime for upgrades and administrative action,” said Alden Stradling, senior Chicoma admin.
HPE Cray System Management also allows for more flexibility in upgrades and more aggressive patching without visible user impact.
“Admins can now leverage modern cloud-enabled toolsets, which benefit from enormous investment and developer attention. Meanwhile, users see this like any other cluster, except with better admin response to feature requests and far less scheduled downtime,” added Stradling.
Chicoma is demonstrating the power of container technologies for supporting complex workflows and non-native software dependencies entirely in user space.
Using Charliecloud, a container runtime originally developed at Los Alamos, scientists have been able to deploy a complex bioinformatics toolchain for pathogen identification in metagenomic samples.
“We used a workflow manager called Cromwell to coordinate the execution of the containers and a python script to automate sample processing,” said Mark Flynn, research scientist at Los Alamos. “Our Cromwell workflow manager used a Charliecloud MySQL database to keep track of each workflow. Without Charliecloud, it would not have been possible to deploy the Cromwell workflow manager without administrator assistance.”
Charliecloud is a fully unprivileged Linux container runtime. Users are able to install Charliecloud in their home directory, and then build and run containers without HPC staff intervention.
“Charliecloud empowers users to explore innovative solutions to challenging problems,” said Charliecloud co-founder Tim Randles, who works in the Laboratory’s HPC Design group. “Researchers now have full control over their runtime environment, allowing them to develop and deploy complex workflows using cutting-edge technologies that would have been difficult to support in a traditional HPC environment.”
The workflow manager and MySQL database ran inside a compute node and spawned new Slurm jobs to run the different bioinformatics tools used to process each sample. The python automation script ran in another compute node to submit samples to the workflow manager. The python dependencies were installed using Miniconda.
“We were able to do everything we needed using Charliecloud containers running in Slurm jobs,” said Flynn.
About Los Alamos National Laboratory
Los Alamos National Laboratory, a multidisciplinary research institution engaged in strategic science on behalf of national security, is managed by Triad, a public service oriented, national security science organization equally owned by its three founding members: Battelle Memorial Institute (Battelle), the Texas A&M University System (TAMUS), and the Regents of the University of California (UC) for the Department of Energy’s National Nuclear Security Administration.
Los Alamos enhances national security by ensuring the safety and reliability of the U.S. nuclear stockpile, developing technologies to reduce threats from weapons of mass destruction, and solving problems related to energy, environment, infrastructure, health, and global security concerns.
LA-UR-21-24650