In the first week of June, a caravan of tractor trailers brought the components needed to begin the installation of Crossroads — the Los Alamos National Laboratory’s newest supercomputer — to the Strategic Computing Complex (SCC). Roughly two weeks later, with the last six compute cabinets and associated water-cooling units having arrived, all the hardware necessary for the assembly and installation of the system was in place.
Days later, an on-site team from Hewlett Packard Enterprise (HPE), the system provider, completed the connection of Crossroads to the Lab’s power and cooling distribution systems. With integration to this critical infrastructure complete, fiber optical cables were then tied into the high-performance computing (HPC) network at the Laboratory.
“Crossroads is emblematic of the future,” said Irene Qualters, associate Laboratory director for Simulation and Computation. “With the introduction of a key codesign element, high-bandwidth memory, Crossroads will deliver four to eight times better performance than Trinity on our most challenging stockpile simulation codes. And we expect additional performance and fidelity gains on future systems.”
Why it matters
Robust supercomputing capabilities are vital to assessing the health of the nation’s nuclear weapons stockpile. Modern experiments generate very large data sets that must be compared with simulation predictions for scientists and engineers to make informed decisions about the nation’s nuclear deterrent mission. To analyze all this information, the Lab needs world-class computational models, platforms and visualization capabilities.
"Crossroads represents a significant advance in the nation’s ability to assess the safety and reliability of the stockpile," said Charlie Nakhleh, associate Laboratory director for Weapons Physics, “as well as modernizing the deterrent to meet a new national security landscape.”
Since 2015, the Trinity supercomputer has provided that ability to users from across the three National Nuclear Security Administration (NNSA) labs (Los Alamos, Sandia and Lawrence Livermore), and Crossroads is its successor.
Currently, Lab crews and HPE are running initial diagnostics for the entire Crossroads system, which is expected to be available to users at three NNSA labs this fall.
About the system
As part of the computing strategy for the NNSA’s Advanced Simulation and Computing Program, Advanced Technology systems (ATS) are deployed to provide leading-edge simulation capability in support of nuclear weapon stockpile stewardship.
“Deploying a world-class supercomputer requires a symphony of expertise and a diverse army of skilled professionals coming together," said Jim Lujan, Crossroads project director for the Lab. “From visionary planners and hardware engineers to software architects and networking experts, it’s a testament to the power of collaborative brilliance in shaping the future of computational possibilities.”
Each supercomputer is represented by a sequential ATS number associated with it. Eventually, each ATS is christened with a unique name.
First, Los Alamos National Laboratory had Trinity (ATS-1). Sierra (ATS-2) was at Lawrence Livermore National Laboratory. Crossroads (ATS-3) is now at Los Alamos.
“Crossroads is the newest ATS platform — so right now it's the star of the show," said Amanda Bonnie, a Lab project manager for Crossroads.
Eventually, El Capitan (ATS-4) will be at Lawrence Livermore and a forthcoming ATS-5 — still unnamed — will be sited at Los Alamos.
Speed and efficiency aren’t the same thing
Because the simulations associated with stockpile stewardship are so demanding, the Advanced Simulation and Computing Program office requires that ATS machines should not only be large and fast, but more importantly, they should feature memory size and memory accessibility geared toward their specific needs.
“Given the hoopla in the press about the ‘fastest computer in the world,’ one might think we should buy computers with the most FLOPS,” explained Gary Grider, leader of the Lab's High Performance Computing division. “Every class of problem requires a different balance of FLOPS, memory size and memory access. For the problems we are working on, the time it takes to get a result is mainly determined by memory size and memory access, not FLOPS.” (FLOPS are a unit for measuring the number of floating-point operations that a computer can perform in a second.)
This philosophy means the Lab is often looking at some of the bleeding-edge areas of the HPC market: a new network type, a new processor type or in Crossroads’ case, a new memory technology.
High-bandwidth memory brings memory directly to the processing chip and allows for quicker “talking” between the CPU and the memory. Many Lab codes are memory bandwidth limited, so it makes sense that this is something HPC is excited about.
Early tests have indicated that Crossroads can be expected to deliver a four to eight times improvement in overall efficiency over Trinity.
“It hardly ever happens in computing that you can move to a new system and see huge gains without changing the codes,” said Grider. “But the switch from Trinity to Crossroads will do just that.”
Rocinante, Razorback and Tycho: Supercomputer subsystems
“There’s a normal sort of set of systems that are associated with these larger ATS procurements, ” said Bonnie. “There is the main system itself, Crossroads in this case, and then smaller supporting systems.”
For Crossroads, these ancillary systems are Rocinante, Razorback and Tycho — named for spacecraft from the science fiction book and television series “The Expanse.”
Because code developers do some of their work in an unclassified environment, a “mini-me” version of Crossroads with the same architecture — just at a smaller scale — is a key component. Called the Application Regression System, it allows users to develop codes and work in an open environment. Rocinante serves that purpose to support Crossroads.
Likewise, there’s a small test bed just for system administrators. Razorback, which regular users can’t access, is an even smaller-scale version of the system that admins use to prepare and test upgrades, patches and other changes before applying them to the larger machines.
Finally, there’s Tycho. Tycho was delivered late last year with almost the same architecture as Crossroads, the difference being that the computing nodes featured more conventional memory rather than the advanced high-bandwidth memory technology. This provides cycles to stockpile simulation users who otherwise might have been waiting on Crossroads. In June, HPC announced that Tycho was available to the three labs via the Advanced Technology Computing Campaign process.
Lujan is quick to acknowledge that bringing Crossroads and its subsystems to life has been the result of a broad group effort. “Literally dozens of Lab staffers from each of HPC’s six groups, as well as staff from across the NNSA trilabs, have made major contributions to this project,” he said.
ATS-5 on the horizon
The major systems the Lab designs and deploys typically have a planning lead time of four to six years, with an optimal operational lifespan of roughly five years. That means that although Crossroads is currently in the process of being installed, HPC has been making plans for the eventual deployment of the as-of-yet unnamed ATS-5 system for the past couple of years.
For an in-depth look at the Lab’s dream machines of past, present and future, check out the Spring 2023 issue of 1663 magazine.