We have manufactured a lot of HPC Clusters over the past 20+ years… 2237, to be exact. One of our goals is to deliver a fully functioning system that remains 100% operational for as long as possible. We understand that capital expense budgets are tight and systems may be in operation well beyond the normal three-year warranty period. In fact, we have seen many clusters that are in production nearly ten years since the delivery date. Seeing these clusters that are a bit long in the tooth gives us great pride in knowing our HPC platforms are some of the most reliable in the industry but they also give us a long term perspective on which components tend to fail over a period time.
What we typically tell our end users is any component with a moving part is going to fail over time. So what components inside of a server have moving parts? Luckily there aren’t many. Power supplies, fans and hard drives are the most obvious ones. Obviously, a systems needs a power supply and fans for operation; but what about a hard drive?
Our PowerWulf HPC Clusters are configured in a very traditional manner with a Head Node and Compute Nodes networked together with a high speed backplane. The Head Node contains the basic operating system and HPC software tools (our tool set is called CBeST). For protection we configure our Head Node with two hard drives configured in a RAID 1 mirror. This is a very inexpensive way to add system protection as well as adding an easy OS upgrade path. It is probably not possible to remove the Head Node hard drives and still have a functioning HPC Cluster. But what about the hard drives on the Compute Nodes? Are they really necessary? The answer is no. Compute Nodes really just need a small Operating System kernel that can be loaded into memory on boot. Our latest version of CBeST Cluster Management Toolkit allows for this diskless Compute Node configuration. Operation is very simple. Just boot up the Head Node and then boot the Compute Nodes which will connect via PXE boot to the Head Node. The Head Node then pushes a small OS kernel down to each Compute Node which stores the image in memory. The process is seamless to the end user.
What are the benefits and drawbacks of a “Diskless” HPC Cluster? As we discussed the most obvious benefit is reduced support cost and down time in the event of Compute Node disk failure. Now no one needs to go into the data center to swap a disk and reimage that drive, configure the network settings and configure the OS. The other benefit is regarding security. Many government agencies we work with required “hardened” systems. Removing the hard drives on the Compute Nodes ensure that’s there isn’t any data that exists on the Nodes that can be extracted by accessing and removing the hard drive. In terms of drawbacks, the only possible issue may be slower time to boot. Our engineers would argue that this is not really an issue with our latest version of CBeST.
All in all, removing the disks from our PowerWulf HPC Cluster should help extend the longevity even more. Who knows maybe one day we will see our cluster reaching a 20-year lifespan!