A partnership between Los Alamos National Laboratory and AirMettle offers a solution for efficiently analyzing highly dimensional data sets from large-scale simulation campaigns while protecting the stored data. Performing some parts of the analytics near data storage reduces the amount of data moved to perform the analysis — reducing both the cost of analytics and the time-to-scientific-insight.
“Our scientific large-scale simulations can generate hundreds of petabytes of highly dimensional floating-point data,” said Gary Grider, High Performance Computing division leader at Los Alamos. “But the data associated with a scientific feature of interest can be orders of magnitude smaller than the written data, so a key challenge is quickly and efficiently finding what’s relevant in this sea of data. To optimize this process, we’ve been drawn towards computational storage — processing data in-place and near storage — to eliminate unnecessary data movement while maintaining parallelism and adequate data protection.”
Building on AirMettle’s Real-Time Smart Data Lake (RT-SDL) architecture, Los Alamos and AirMettle have defined a common Applications Programming Interface (API) to extend the Non-Volatile Memory Express standard for computational storage devices, empowering them to support in-place analytics. RT-SDL enables scalable analytics to be done near storage using standard interfaces like the S3 object storage interface and standard data formats like Apache Parquet while integrating rigorous data protection using erasure coding.
Scalable and cost-efficient data processing
In extending that technology, computational tasks will be delegated down to the device level, so data can be processed in a far more scalable and power-efficient manner. Reduction of the data near storage means a smaller analytics processing capability can be used as well. These enhancements build on the benefits of AirMettle’s existing unique architecture.
“Accelerating analytics of vast volumes of experiment and simulation data is a key requirement and challenge for the scientific community,” said Donpaul Stephens, founder and CEO of AirMettle, Inc. “AirMettle’s RT-SDL is the first computational storage service with highly scalable in-place processing to accelerate analytics by 100 times or more and significantly reduce network traffic. Users can easily store and retrieve their data in our object store via standard APIs. AirMettle stripes this data across hundreds of storage nodes, eliminating hot spots for both traditional storage access and high-speed parallel analytics.
Working with Los Alamos, AirMettle recently published an open-source reference design with APIs, for utilizing analytics in computational storage devices, enabling further scalability and efficiency. AirMettle will be presenting this at the 2023 Open Compute Project Global Summit in October.