Weather Modeling - Building the Perfect HPC Cluster

Constructing an HPC Cluster for weather modeling requires expertise and loads of preparation. It’s important to know both the scale and scope of the workloads to be performed before beginning to build out any HPC Cluster platform. It’s also important to note which specific weather modeling applications will be utilized. With WRF being one of the most popular and widely used applications in the weather modeling industry, we’ve been able to develop years of experience and expertise building WRF specific HPC Clusters.

WRF is extremely powerful and requires significant computing resources to operate efficiently. As a result, organizations looking to utilize WRF see the most success when deployed in an on premise environment. This is especially true when the volume of the runs, required resolutions and sizes of the models are significant. Running the same WRF models using cloud resources instead of on premise HPC Clusters would cost anywhere from 300% to 500% more — without the guarantee of answers in a timely manner.

With so many things to consider and weigh when building an HPC Cluster for WRF, we outlined the five most important factors to keep front of mind to help.

5 Things to Consider when Building an HPC Cluster for WRF

Determine the Number of Processor Cores Required. This step is usually easier for those who have experience running WRF on an existing HPC cluster, but if that’s not you, no need to worry. Instead, try to determine the complexity of the models, resolution required and number of runs per day/week/month/year. By doing so, you should be able to determine with some degree of certainty the size of the HPC Cluster.It’s also important to select the right processor manufacturer and model during this step. A good rule of thumb to follow is to select a processor that offers higher clock speeds, (i.e. above 2.4GHz), and worry less about the highest number of cores per processor. This will help ensure you are not degrading performance by exceeding the system’s memory bandwidth capabilities. At PSSC Labs, we have primarily used Intel^® Xeon^® Scalable processors to date, but we’re starting to see more and more interest in AMD EPYC ^TM processors, due to lower cost and the potential for higher clock speeds. Our initial performance results when comparing the two different types of processors do not provide a clear cut winner. One more important note is that using the Intel^® Cluster Studio XE Compiler Suite can provide a huge performance improvement; tipping the scales in the Intel^® Xeon^® processor favor if you are on the fence choosing between the two processors.
Select the Amount of Memory per Core. At the absolute bare minimum, we recommend 2 GB memory per core, though we don’t deliver many HPC Cluster systems with less than 4GB memory per core. Going higher, say to 8 GB, would likely be overkill. Keep in mind that this step of the process is really about configuring the memory for maximum memory bandwidth, as memory access will have a huge impact on the overall cluster performance. With WRF, it’s all about moving data in and out of the processors as quickly as possible, and memory bandwidth has a tremendous impact on performance.
Build the Fastest Possible Network Backplane Your Budget Allows. This step should be front of mind for larger clusters (i.e. several thousand cores). As stated in the above step, getting the data to the processor as fast as possible is critical and having the highest speed network backplane for an HPC Cluster significantly helps in this effort. We typically employ Intel^® Omni-Path and Mellanox^® InfiniBand^® when building HPC Clusters for WRF. With NVIDIA’s Mellanox’s latest 200 Gb/sec HDR Infiniband^® backplane, Mellanox has taken the lead over Intel^® Omnipath^® which tops out at 100 Gb/sec. Mellanox does offer a cost effective 40 port 200 Gb/sec HDR Top of Rack switch, which is becoming more and more of a standard for our HPC Clusters.
Consider Using a Parallel File System to Increase Performance and Offer Scalability. We’ve worked with several parallel file systems, including HDFS, GlusterFS and LusterFS. Each of these has their own pluses and minuses but they all offer a significant improvement over a standard NAS or NFS storage server. With a parallel file system, you’re essentially spreading the load of data access across multiple storage nodes and targets. Our Parallux Storage Clusters have achieved over 50 GB/sec sustained Read / Writes. This represents a huge improvement over stand-alone NAS servers that max out around 2.5 GB/sec. Faster access to data means reduced computing times because you’re better able to keep the processors maxed out with data access.
Build with the Future in Mind. Like most HPC applications, WRF can consume all the computing resources you can throw at it and still keep asking for more. We always allow room in our HPC Clusters to double or triple the number of processor cores over time. Adding nodes to an existing HPC Cluster is not complicated. Using simple tools, like Clonezilla, will allow you to keep expanding your cluster as needs grow and your budget allows.

Building an HPC cluster for WRF doesn’t have to be overwhelming, especially when working with the an experienced and knowledgeable HPC Cluster manufacturer. PSSC Labs has 25+ years of experience working closely with clients to determine their needs and architect a custom HPC Cluster. For more information please visit https://pssclabs.com/solutions/weather-modeling.

For questions regarding our HPC Clusters for WRF and other weathers models, please feel free to contact us at 4sales@pssclabs.com or (949) 380-7288.