Understanding “Diskless” HPC Clusters
We have manufactured a lot of HPC Clusters over more than 25+ years. One of our key goals is to deliver a fully functioning system that meets your unique needs and that remains fully operational for as long as possible. We understand that capital expense budgets are tight and systems may be in operation well beyond the normal three-year warranty period. In fact, we have seen many clusters that are in production for nearly ten years since the initial delivery date. Seeing these legacy HPC clusters still in operation for nearly a decade gives us and our clients great pride in knowing that these HPC cluster platforms are some of the most reliable in the industry.
Preparing for the Unexpected
What we typically tell our clients is that any technical component with a moving part will eventually need to be replaced over time. So what components inside of a server have moving parts? Luckily there aren’t many. Power supplies, fans, and hard drives are the most obvious ones. Obviously, a system needs a power supply and fans for operation; but what about a hard drive?
The PowerWulf HPC Clusters are configured in a very traditional manner with a Head Node and Compute Nodes networked together with a high-speed backplane. The Head Node contains the basic operating system and HPC software tools (our toolset is called CBeST). For protection, we configure our Head Node with two hard drives configured in a RAID 1 mirror. This is a very inexpensive way to add system protection as well as adding an easy OS upgrade path. It is probably not possible to remove the Head Node hard drives and still have a functioning HPC Cluster. But what about the hard drives on the Compute Nodes? Are they really necessary? The answer is no. Compute Nodes really just need a small Operating System kernel that can be loaded into memory on boot. Our latest version of CBeST Cluster Management Toolkit allows for this diskless Compute Node configuration. Operation is very simple. Just boot up the Head Node and then boot the Compute Nodes which will connect via PXE boot to the Head Node. The Head Node then pushes a small OS kernel down to each Compute Node which stores the image in memory. The great thing is that this process is seamless to the end-user.
What are the benefits and drawbacks of a “Diskless” HPC Cluster? As we discussed the most obvious benefit is reduced support cost and downtime in the event of Compute Node disk failure. Now no one needs to go into the data center to swap a disk and reimage that drive, configure the network settings and configure the OS. The other benefit is regarding security. Many government agencies we work with required “hardened” systems. Removing the hard drives on the Compute Nodes ensures that there isn’t any data that exists on the Nodes that can be extracted by accessing and removing the hard drive. In terms of drawbacks, the only possible issue may be a slower time to boot. Our engineers would argue that this is not really an issue with our latest version of CBeST.
All in all, removing the disks from our PowerWulf HPC Cluster should help extend the longevity even more. Who knows maybe one day we will see our cluster reaching a 20-year lifespan!
To learn more about our PowerWulf HPC Cluster and the “Diskless” HPC Cluster email us at 4sales@pssclabs.com to learn more.