Loading...
Thumbnail Image
Publication

Files as first-class objects in fault -tolerant concurrent systems

Matthews, Robert Edwin
Abstract
Concurrent systems are used in applications where multiple processors are needed to complete tasks within a reasonable amount of time, or where the data sets involved will not fit within the main memory of a single computer. Because of their reliance on multiple machines, such systems are proportionally more vulnerable to both hardware and software induced failures. Fault-tolerance schemes are used to recover some earlier consistent state of the system after such a failure.;One important technique used to achieve fault-tolerance is checkpointing and rollback-recovery. In this thesis, we present a method for efficiently and transparently incorporating the part of the process state contained in the file system into process checkpoints, and we show how recovery of consistent versions of the file system and processes may be done after a failure. We present the details of a prototype system which implements our method.;We show that by using the special properties of the log-structured file system, the class of programs which are amenable to checkpointing and rollback-recovery schemes can be expanded to include those that use files. We impose no a priori restriction on the types of file system operations that can be done, and we demonstrate that our scheme does not impose significant failure-free overhead on the computation.
Description
Date
2004-01-01
Journal Title
Journal ISSN
Volume Title
Publisher
Download Dataset
Rights Holder
Usage License
Embargo
Research Projects
Organizational Units
Journal Issue
Keywords
Citation
Department
Computer Science
DOI
https://dx.doi.org/doi:10.21220/s2-qa9t-5z25
Embedded videos