APIs

int FTI_Init(const char *configFile, MPI_Comm globalComm)

Initializes FTI.

This function initializes the FTI context and prepares the heads to wait for checkpoints. FTI processes should never get out of this function. In case of a restart, checkpoint files should be recovered and in place at the end of this function.

Return

integer FTI_SCES if successful.

Parameters
  • configFile: FTI configuration file.

  • globalComm: Main MPI communicator of the application.

int FTI_Status()

It returns the current status of the recovery flag.

This function returns the current status of the recovery flag.

Return

integer FTI_Exec.reco.

int FTI_InitType(FTIT_type *type, int size)

It initializes a data type.

This function initalizes a data type. The only information needed is the size of the data type, the rest is black box for FTI. Types saved as byte array in case of HDF5 format.

Return

integer FTI_SCES if successful.

Parameters
  • type: The data type to be intialized.

  • size: The size of the data type to be intialized.

int FTI_InitComplexType(FTIT_type *newType, FTIT_complexType *typeDefinition, int length, size_t size, char *name, FTIT_H5Group *h5group)

It initializes a complex data type.

This function initalizes a simple data type. New type can only consists fields of flat FTI types (no arrays). Type definition must include:

  • length => number of fields in the new type

  • field[].type => types of the field in the new type

  • field[].name => name of the field in the new type

  • field[].rank => number of dimentions of the field

  • field[].dimLength[] => length of each dimention of the field

Return

integer FTI_SCES if successful.

Parameters
  • newType: The data type to be intialized.

  • typeDefinition: Structure definition of the new type.

  • length: Number of fields in structure

  • size: Size of the structure.

  • name: Name of the structure.

  • h5group: Group of the type.

void FTI_AddSimpleField(FTIT_complexType *typeDefinition, FTIT_type *ftiType, size_t offset, int id, char *name)

It adds a simple field in complex data type.

This function adds a field to the complex datatype. Use offsetof macro to set offset. First ID must be 0, next one must be +1. If name is NULL FTI will set “T${id}” name. Sets rank and dimLength to 1.

Return

integer FTI_SCES if successful.

Parameters
  • typeDefinition: Structure definition of the complex data type.

  • ftiType: Type of the field

  • offset: Offset of the field (use offsetof)

  • id: Id of the field (start with 0)

  • name: Name of the field (put NULL if want default)

void FTI_AddComplexField(FTIT_complexType *typeDefinition, FTIT_type *ftiType, size_t offset, int rank, int *dimLength, int id, char *name)

It adds a simple field in complex data type.

This function adds a field to the complex datatype. Use offsetof macro to set offset. First ID must be 0, next one must be +1. If name is NULL FTI will set “T${id}” name.

Return

integer FTI_SCES if successful.

Parameters
  • typeDefinition: Structure definition of the complex data type.

  • ftiType: Type of the field

  • offset: Offset of the field (use offsetof)

  • rank: Rank of the array

  • dimLength: Dimention length for each rank

  • id: Id of the field (start with 0)

  • name: Name of the field (put NULL if want default)

int FTI_GetStageDir(char *stageDir, int maxLen)

Places the FTI staging directory path into ‘stageDir’.

This function places the FTI staging directory path in ‘stageDir’. If allocation size is not sufficiant, no action is perfoprmed and FTI_NSCS is returned.

Return

integer FTI_SCES if successful, FTI_NSCS else.

Parameters
  • stageDir: pointer to allocated memory region.

  • maxLen: size of allocated memory region in bytes.

int FTI_GetStageStatus(int ID)

Returns status of staging request.

This function returns the status of the staging request corresponding to ID. The ID is returned by the function ‘FTI_SendFile’. The status may be one of the five possible statuses:

Return

integer Status of staging request on success, FTI_NSCS else.

Parameters
  • ID: ID of staging request.

FTI_SI_FAIL - Stage request failed FTI_SI_SCES - Stage request succeed FTI_SI_ACTV - Stage request is currently processed FTI_SI_PEND - Stage request is pending FTI_SI_NINI - There is no stage request with this ID

Note

If the status is FTI_SI_NINI, the ID is either invalid or the request was finished (succeeded or failed). In the latter case, ‘FTI_GetStageStatus’ returns FTI_SI_FAIL or FTI_SI_SCES and frees the stage request ressources. In the consecutive call it will then return FTI_SI_NINI.

int FTI_SendFile(char *lpath, char *rpath)

Copies file asynchronously from ‘lpath’ to ‘rpath’.

This function may be used to copy a file local on the nodes via the FTI head process asynchronously to the PFS. The file will not be removed after successful transfer, however, if stored in the directory returned by ‘FTI_GetStageDir’ it will be removed during ‘FTI_Finalize’.

Return

integer Request handle (ID) on success, FTI_NSCS else.

Parameters
  • lpath: absolute path local file.

  • rpath: absolute path remote file.

If staging is enabled but no head process, the staging will be performed synchronously (i.e. by the calling rank).

int FTI_InitGroup(FTIT_H5Group *h5group, char *name, FTIT_H5Group *parent)

It initialize a HDF5 group.

Initialize group defined by user. If parent is NULL this mean parent will be set to root group.

Return

integer FTI_SCES if successful.

Parameters
  • h5group: H5 group that we want to initialize

  • name: Name of the H5 group

  • parent: Parent H5 group

int FTI_setIDFromString(char *name)

Searches in the protected variables for a name. If not found it allocates and returns the ID.

This function searches for a given name in the protected variables and returns the respective id for it.

Return

integer id of the variable.

Parameters
  • name: Name of the protected variable to search

int FTI_getIDFromString(char *name)

Searches in the protected variables for a name. If not found it allocates and returns the ID.

This function searches for a given name in the protected variables and returns the respective id for it.

Return

integer id of the variable.

Parameters
  • name: Name of the protected variable to search

int FTI_RenameGroup(FTIT_H5Group *h5group, char *name)

Renames a HDF5 group.

This function renames HDF5 group defined by user.

Return

integer FTI_SCES if successful.

Parameters
  • h5group: H5 group that we want to rename

  • name: New name of the H5 group

int FTI_Protect(int id, void *ptr, int32_t count, FTIT_type type)

It sets/resets the pointer and type to a protected variable.

This function stores a pointer to a data structure, its size, its ID, its number of elements and the type of the elements. This list of structures is the data that will be stored during a checkpoint and loaded during a recovery. It resets the pointer to a data structure, its size, its number of elements and the type of the elements if the dataset was already previously registered.

Return

integer FTI_SCES if successful.

Parameters
  • id: ID for searches and update.

  • ptr: Pointer to the data structure.

  • count: Number of elements in the data structure.

  • type: Type of elements in the data structure.

int FTI_DefineGlobalDataset(int id, int rank, FTIT_hsize_t *dimLength, const char *name, FTIT_H5Group *h5group, FTIT_type type)

Defines a global dataset (shared among application processes)

This function defines a global dataset which is shared among all ranks. In order to assign sub sets to the dataset the user has to call the function ‘FTI_AddSubset’. The parameter ‘did’ of that function, corres- ponds to the global dataset id define here.

Return

integer FTI_SCES if successful.

Parameters
  • id: ID of the dataset.

  • rank: Rank of the dataset.

  • dimLength: Dimention length for each rank.

  • name: Name of the dataset in HDF5 file.

  • h5group: Group of the dataset. If Null then “/”.

  • type: FTI type of the dataset.

int FTI_AddSubset(int id, int rank, FTIT_hsize_t *offset, FTIT_hsize_t *count, int did)

Assigns a FTI protected variable to a global dataset.

This function assigns the protected dataset with ID ‘id’ to a global data- set with ID ‘did’. The parameters ‘offset’ and ‘count’ specify the selec- tion of the sub-set inside the global dataset (‘offset’ and ‘count’ cor- respond to ‘start’ and ‘count’ in the HDF5 function ‘H5Sselect_hyperslab’ For questions on what they define, please consult the HDF5 documentation.)

Return

integer FTI_SCES if successful.

Parameters
  • id: Corresponding variable ID.

  • rank: Rank of the dataset.

  • offset: Starting coordinates in global dataset.

  • count: number of elements for each coordinate.

  • did: Corresponding global dataset ID.

int FTI_UpdateGlobalDataset(int id, int rank, FTIT_hsize_t *dimLength)

Updates global dataset (shared among application processes)

updates only the rank and number of elements for each coordinate direction.

Parameters
  • id: ID of the dataset.

  • rank: Rank of the dataset.

  • dimLength: Dimention length for each rank.

int FTI_GetDatasetRank(int did)

returns rank of shared dataset

Return

integer rank of dataset.

Parameters
  • id: ID of the dataset.

FTIT_hsize_t *FTI_GetDatasetSpan(int did, int rank)

returns static array of dataset dimensions

Parameters
  • id: ID of the dataset.

  • rank: Rank of the dataset.

int FTI_RecoverDatasetDimension(int did)

loads dataset dimension from ckpt file to dataset ‘did’

Parameters
  • id: ID of the dataset.

int FTI_DefineDataset(int id, int rank, int *dimLength, char *name, FTIT_H5Group *h5group)

Defines the dataset.

This function gives FTI all information needed by HDF5 to correctly save the dataset in the checkpoint file.

Return

integer FTI_SCES if successful.

Parameters
  • id: ID for searches and update.

  • rank: Rank of the array

  • dimLength: Dimention length for each rank

  • name: Name of the dataset in HDF5 file.

  • h5group: Group of the dataset. If Null then “/”

int32_t FTI_GetStoredSize(int id)

Returns size saved in metadata of variable.

This function returns size of variable of given ID that is saved in metadata. This may be different from size of variable that is in the program. If this function it’s called when recovery it returns size from metadata file, if it’s called after checkpoint it returns size saved in temporary metadata. If there is no size saved in metadata it returns 0.

Return

int32_t Returns size of variable or 0 if size not saved.

Parameters
  • id: Variable ID.

void *FTI_Realloc(int id, void *ptr)

Reallocates dataset to last checkpoint size.

Return

ptr Pointer if successful, NULL otherwise This function loads the checkpoint data size from the metadata file, reallacates memory and updates data size information.

Parameters
  • id: Variable ID.

  • ptr: Pointer to the variable.

int FTI_BitFlip(int datasetID)

Bit-flip injection following the injection instructions.

This function injects the given number of bit-flips, at the given frequency and in the given location (rank, dataset, bit position).

Return

integer FTI_SCES if successful.

Parameters
  • datasetID: ID of the dataset where to inject.

int FTI_Checkpoint(int id, int level)

It takes the checkpoint and triggers the post-ckpt. work.

This function starts by blocking on a receive if the previous ckpt. was offline. Then, it updates the ckpt. information. It writes down the ckpt. data, creates the metadata and the post-processing work. This function is complementary with the FTI_Listen function in terms of communications.

Return

integer FTI_SCES if successful.

Parameters
  • id: Checkpoint ID.

  • level: Checkpoint level.

int FTI_InitICP(int id, int level, bool activate)

Initialize an incremental checkpoint.

This function defines the environment for the incremental checkpointing mechanism. The iCP mechanism consists of three functions: FTI_InitICP, FTI_AddVarICP and FTI_FinalizeICP. The two functions FTI_InitICP and FTI_FinalizeICP define the iCP region within the user may write the protected variables in any order. The iCP region is active, when the expression passed through ‘activate’ evaluates to TRUE.

Return

integer FTI_SCES if successful.

Parameters
  • id: Checkpoint ID.

  • level: Checkpoint level.

  • activate: Boolean expression.

Note

This function is not blocking for POSIX, FTI-FF and HDF5, but, blocking for MPI-IO. This is due to the collective open call in MPI_IO

int FTI_AddVarICP(int varID)

Write variable into the CP file.

With this function, the user may write the protected datasets in any order into the checkpoint file. However, before the call to FTI_FinalizeICP, all protected variables must have been written into the file.

Return

integer FTI_SCES if successful.

Parameters
  • id: Protected variable ID.

int FTI_FinalizeICP()

Finalize an incremental checkpoint.

This function finalizes an incremental checkpoint. In contrast to InitICP, this function is collective on the communicator FTI_COMM_WORLD and blocking.

Return

integer FTI_SCES if successful.

int FTI_Recover()

It loads the checkpoint data.

This function loads the checkpoint data from the checkpoint file and it updates some basic checkpoint information.

Return

integer FTI_SCES if successful.

int FTI_Snapshot()

Takes an FTI snapshot or recovers the data if it is a restart.

This function loads the checkpoint data from the checkpoint file in case of restart. Otherwise, it checks if the current iteration requires checkpointing, if it does it checks which checkpoint level, write the data in the files and it communicates with the head of the node to inform that a checkpoint has been taken. Checkpoint ID and counters are updated.

Return

integer FTI_SCES if successful.

int FTI_Finalize()

It closes FTI properly on the application processes.

This function notifies the FTI processes that the execution is over, frees some data structures and it closes. If this function is not called on the application processes the FTI processes will never finish (deadlock).

Return

integer FTI_SCES if successful.

int FTI_RecoverVar(int id)

Recovers given variable.

Return

integer FTI_SCES if successful.

Parameters
  • integer: id of variable to be recovered

Warning

doxygenfunction: Cannot find function “FTI_Print” in doxygen xml output for project “Fault Tolerance Library” from directory: ../Doxygen/xml