Introduction¶

In high-performance computing (HPC), systems are built from highly reliable components. However, the overall failure rate of supercomputers increases with the component count. Nowadays, petascale machines have a mean time between failures (MTBF) measured in hours and fault tolerance (FT) is a well-known issue. Long-running large applications rely on FT techniques to successfully finish their long executions. Checkpoint/Restart (CR) is a popular technique in which the applications save their state in stable storage, frequently a parallel file system (PFS). Upon a failure, the application restarts from the last saved checkpoint. CR is a relatively inexpensive technique in comparison with the process-replication scheme that imposes over 100% of overhead. However, when a large application is checkpointed, tens of thousands of processes will each write several GBs of data and the total checkpoint size will be in the order of several tens of TBs. Since the I/O bandwidth of supercomputers does not increase at the same speed as computational capabilities, large checkpoints can lead to an I/O bottleneck, which causes up to 25% of overhead in current petascale systems. Post-petascale systems with a significantly larger number of components and an important amount of memory will observe an impact on the system’s reliability. With a shorter MTBF, those systems may require a higher checkpoint frequency and at the same time, they will have significantly larger amounts of data to save. At the same time, we observe deeper storage hierarchies, with high-bandwidth memories (HBM), non-volatile memories (NVM), solid-state drives (SSD) among others. Such a deep storage hierarchy can be complemented with cross-node redundancy schemes, such as checkpoint replication, XOR encoding or even more complex erasure codes such as Reed-Solomon (RS) encoding which is the basis of multi-level checkpointing. FTI is a multi-level checkpointing library with a simple API easy to use and offers a flexible configuration to enable the user to select the checkpointing strategy which fits best to the problem.