Distributed Parallel File Systems & HPC Strategies - Updated

Distributed parallel file systems have been a core technology to accelerate high-performance computing (HPC) workloads for nearly two decades (Lustre 1.0.0 was released in 2003). While first the purview of supercomputing centers, distributed parallel file systems are now routinely used in mainstream HPC applications.

However, if you have not looked at the available options in a while, you will find new thinking is required when evaluating solutions for today’s workloads. What’s changed? Almost everything.

First, today’s workloads and the data they operate on are different. Applications for modeling physical systems, Big Data analytics, and the training and use of artificial intelligence (AI) and machine learning (ML) typically have very low latency and high IO throughput requirements. They need a storage and interconnectivity infrastructure that efficiently stages data for processing and blasts that data to servers for processing without delays. Solutions must have both very high sustained throughputs and low latency.

Another workload and data change that impacts the choice of a file system is that many applications now must accommodate both large blocks of data and many small files. For instance, training an ML model might use millions of small files, while a Big Data analytics workload runs on one massive dataset. There also is much more use of metadata (consisting of numerous small files) in many workloads today.

Second, the processors used on today’s workloads are different. In the past, cluster architecture came down to a choice between fat nodes (typified by their large number of cores per node) or thin (sometimes called skinny) nodes with fewer, faster-performing cores per node. The choice of one type of node versus another, based on the nature of the workload, had great implications for a file system choice.

Advanced modeling and simulation algorithms, such as a finite element simulation, are massively parallel workloads. They often required fat nodes because a job needs to use a very large number of cores at a given time. Such workloads where job execution steps are run in parallel typically need a fast and low latency network and storage with high IO and throughput.

Sequential workloads were each step of a job can be executed independently of the others benefit from very fast CPUs. Skinny nodes are well-suited for this type of workload. From an infrastructure perspective, storage systems for a cluster built of skinny nodes must have the capacity to handle and manage the typically large datasets analyzed and processed in workloads that run sequentially. Additionally, the storage systems must be able to support the numerous read/write operations of many small-sized files without degradation in performance.

In both cases, the file system selected had to support the performance requirements of the applications and data types. Now, there is a relatively new processor choice that places different demands on a file system. That new option is the growing use of GPUs for compute-intensive applications (versus their traditional use for rendering). GPUs include many cores that execute in parallel. They are typically expensive. Making efficient use of a GPU’s capacity requires a file system that can ensure the cores are satiated.

Finally, the last difference is that the available distributed parallel file systems have changed. Some have been around for many years but now have new names. All are frequently updated with new features and capabilities. Today, the three main choices are:

Lustre: Lustre file system software is available under the GNU General Public License (version 2 only) and provides high-performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site clusters. It’s most recent update is Lustre 2.13, which was released on December 5, 2019. This release added a new performance-related features Persistent Client Cache (PCC), which allows direct use of NVMe and NVRAM storage on the client nodes while keeping the files part of the global filesystem namespace, and OST Overstriping, which allows files to store multiple stripes on a single OST to better utilize fast OSS hardware. Also, in this release, the PFL functionality was enhanced with Self-Extending Layouts (SEL) to allow file components to be dynamically sized, to better deal with flash OSTs that may be much smaller than disk OSTs within the same filesystem.

IBM Spectrum Scale: Formerly known as IBM General Parallel File System (GPFS), IBM Spectrum Scale is a high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel modes. Features of its architecture include distributed metadata (including the directory tree), efficient indexing of directory entries for very large directories, and distributed locking. Additionally, filesystem maintenance can be performed while systems are online, ensuring the filesystem is available more often, thus keeping the HPC cluster itself available longer.

BeeGFS: BeeGFS is a parallel file system that was initially developed at Fraunhofer Center for High Performance Computing in Germany by a team led by Sven Breuner, who later became the CEO of ThinkParQ, the spin-off company that was founded in 2014 to maintain BeeGFS. BeeGFS combines multiple storage servers to provide a highly scalable shared network file system with striped file contents. This way, it allows users to overcome the tight performance limitations of single servers, single network interconnects, and a limited number of hard drives. In such a system, high throughput demands of large numbers of clients can easily be satisfied, but even a single client can benefit from the aggregated performance of all the storage servers in the system.

Gluster: Similar to the way the Lustre name was derived combining Linux and cluster, Gluster combines GNU and cluster. GlusterFS is a scalable network filesystem suitable for data-intensive tasks such as cloud storage and media streaming. The GlusterFS architecture aggregates compute, storage, and I/O resources into a global namespace. Capacity is scaled by adding additional nodes or adding additional storage to each node. Performance is increased by deploying storage among more nodes. High availability is achieved by replicating data n-way between nodes.

Making Sense of the Changes and Choices

Given these changes in workloads, processor choices, and available parallel file systems, how do you match a system to your company’s HPC needs? You can certainly do it yourself. However, with HPC analytics and AI applications becoming mainstream, most companies do not have the time or in-house expertise to evaluate, select, assemble, deploy, and maintain systems from scratch.

A better approach is to team with an industry partner that has expertise in the new applications and solutions, plus best practices developed from a long track record of successful deployments. That’s where we can help.

PSSC Labs has a more than 30 years history of delivering systems that meet the most demanding workloads across industries, government, and academia.

Its offerings include the Parallux Storage Clusters, which are a cost-efficient and scalable storage solution. Key features include:

Scale to tens of petabytes of capacity with no downtime upgrade
Storage tiers available for warm/cold data requirements
Distributed metadata for high performance and redundancy
Compatible with POSIX File Systems
Extreme performance exceeding 10GB/sec sustained IO
Factory installation of all necessary hardware, software, and networking components
Only enterprise-grade components used for maximum reliability and performance

This solution and other PSSC Labs systems are designed to meet the compute requirements of modern enterprise applications today. Such systems will increasingly become more important as companies make greater use of analytics on more and more datasets, as well as embracing AI and ML.