Chicoma supercomputer aids COVID-19 research and open science

75 new projects underway

March 15, 2022

Carolyn Connor, interim IC program manager, and Alden Stradling at the terminal with Cory Leuinghoener looking on. The supercomputer was named for Chicoma Mountain, the highest point in the Jemez Mountains, held sacred by many of the Puebloan peoples of New Mexico.

For the past year, scientists at Los Alamos National Laboratory have used the Laboratory’s newest and highest performing institutional supercomputer, Chicoma, to grapple with critical questions, delivering high-impact research results in support of the nation’s response to COVID-19.  Initially funded by the DOE Office of Science, Chicoma has recently been upgraded with next-generation general-purpose graphics processing units, making it even more capable for expanded workloads in both data and computational science.

“Through the efforts of our High Performance Computing group and our research scientists, Chicoma has already contributed to understanding and combating COVID at the local, national and international level,” said Irene Qualters, associate Laboratory director for Simulation and Computing. “Now, as new research teams get underway, we look forward to stimulating additional technology innovation while enabling new scientific understanding through advances in computational modeling, simulation and machine learning.”

Beginning in December 2021, Laboratory scientists and engineers launched an impressive portfolio of 75 projects for 2022. Chosen through the Laboratory’s Institutional Computing (IC) Program’s annual competitive, peer-reviewed process, the research projects encompass applied machine learning, artificial intelligence, astrophysics, climate modeling, fluid dynamics, HIV, nanotechnology, wildfire simulation and national security science, to list but a few.

For example, one project will consider computational structure-based strategies to vaccine design led by Gnana Gnanakaran, a biophysicist in the Laboratory’s Theoretical division. The approaches to vaccine design explored by the project could point the way to protection again the known range of diversity found in the coronaviruses, including SARS- CoV-1, SARS-CoV-2 and MERS.

“Ultimately, the research we’re able to carry out on Chicoma will help us prepare for future pandemics of coronavirus zoonotic infections,” said Gnanakaran.

The crown jewel of Institutional Computing

Chicoma is one of five high performance computing systems available to Los Alamos scientists and engineers (students included) through the Lab’s IC program for unclassified work. Each system is tailored to a specific class of workloads. However, Chicoma is unique both in the capabilities that it offers and its forward technical look.

“Chicoma is the first system specifically designed to offer multiple architectures within a single large platform,” Carolyn Connor, interim IC program manager, said. “Chicoma’s flexible compute environment, including large-scale platform, architectural diversity and node capability, make it valuable for a wide range of scientific problems of interest to the Laboratory.”

Chicoma is among the earliest deployments anywhere of Hewlett Packard Enterprise’s new HPE Cray EX supercomputer architecture for solving complex scientific problems. The HPE Cray EX offers a large-scale system architecture employing AMD processors, with a next-generation system software stack, direct-to-chip liquid cooling capabilities and a newly designed high-speed HPE Slingshot interconnect. In total, Chicoma offers more than 79,000 cores, 300 terabytes of system memory, and access to A100 Tensor Core graphics processing units (GPUs), making it by far the Lab’s highest performing institutional supercomputer.

Smoother upgrade experience, fewer workload disruptions

Chicoma also features innovative system administration and management software.

Chicoma is among the earliest large-scale production deployments using HPE Cray System Management (CSM), a more efficient and radically different system management model that enables transparent software updates, fixes and system refinements, which translates to higher system availability for IC’s open science user community. 

“While the user environment will retain a familiar look and feel, administrators will be able to use continuous integration techniques to safely upgrade, patch and refine the system while user jobs are running, and roll out updates and fixes transparently,” said Sam Sanchez, Chicoma transition lead. “We plan to incorporate CSM in all future systems. The software and hardware environments, the configuration management, and how we provision the system, make CSM a major advancement in systems management.”

Stepping stone to future computing capability: Venado

Chicoma positions the Laboratory’s users, system designers and infrastructure for the next, more capable IC system, Venado.

To be deployed to early users in 2023 and generally available in early 2024, Venado is a codesign technology partnership between the Laboratory, Hewlett Packard Enterprise (HPE) and Nvidia. It will employ the same HPE Cray EX technology platform with Slingshot interconnect and Cray Programming Environment as Chicoma, but the processing power will increase significantly with Nvidia’s next-generation (ARM-based) Grace CPU and Nvidia ANext GPUs.

“Chicoma is a powerful resource and a stepping stone to future technologies aligned with our longer-term computing strategy,” said Qualters. “Chicoma provides Los Alamos scientists with state-of-the-art supercomputing capability and positions our researchers and applications to easily adopt the next generation of leading-edge computers.”

Real COVID-19 research performed on Chicoma out of the box

When the Lab completed the installation of Chicoma in 2020, the early user period was unlike the typical testing period for a new supercomputer.

“Typically, Los Alamos HPC early platform users, who include researchers, domain scientists, programmers and support staff, push the boundary of newly deployed, cutting-edge hardware and software,” said Ben Santos, team leader for High Performance Computing Consultants, HPC Workload Management and HPC Account Processing teams. “We work closely with vendors to stabilize the platform by running real applications and using real user workflows to verify the system stability and reliability.”

But with Chicoma, it went beyond those typical exercises. In addition to the usual expectations for early users and HPC staff when standing up a new computing system, Chicoma’s early users were laser-focused on producing impactful COVID-19 research immediately to help in the fight against the global pandemic.

“Our early users were thrilled with their experience, their access to fast computational resources with large amounts of memory, and the research results that Chicoma enabled,” said Lena Lopatina, staff scientist.

Tim Germann's research has helped local, state and national decision makers during the pandemic.

Some of the early user team leads and projects included COVID-19 epidemiological modeling and forecasting, analyzing the impacts of COVID-19 on human 4D chromosome structures and functions, and exploring bioinformatics for COVID-19. Tim Germann, a researcher in the Laboratory’s Theoretical Division, ran large-scale computer simulations on Chicoma, resulting in a presentation at the American Physical Society’s March 2021 meeting. His modeling predicted the effects of social distancing relaxation and reinfection probability on the progression of the pandemic measured by the number of newly symptomatic cases per day for 2021.

Using Chicoma, Germann’s team also studied the effects of varying vaccination rates and vaccine efficacy against novel variants on the progression of the pandemic as measured by the number of newly symptomatic cases per day for 2021. In April 2021 he presented the findings at the White House COVID Data Strategy and Execution Workgroup. This presentation offered various models of the COVID-19 pandemic progression, including social distancing relaxation dates, vaccine efficacy against novel variants, increased vaccine availability and decreased vaccine hesitancy, both at national and state levels.

Funding: The National Nuclear Security Administration (NNSA) supported infrastructure and personnel, and the Department of Energy’s Office of Science Advanced Scientific Computing Research (ASCR) program, under a line item of the 2020 Coronavirus Aid, Relief, and Economic Security (CARES) Act, provided for the initial installment of advanced computing hardware.