Fault-Tolerance in the Borealis Distributed Stream Processing System

Magdalena Balazinska, Hari Balakrishnan, Samuel Madden, Mike Stonebraker
ACM SIGMOD Conf., Baltimore, MD, June 2005

We present a replication-based approach to fault-tolerant distributed stream processing in the face of node failures, network failures, and network partitions. Our approach aims to reduce the degree of inconsistency in the system while guaranteeing that available inputs capable of being processed are processed within a specified time threshold. This threshold allows a user to trade availability for consistency: a larger time threshold decreases availability but limits inconsistency, while a smaller threshold increases availability but produces more inconsistent results based on partial data. In addition, when failures heal, our scheme corrects previously produced results, ensuring eventual consistency. Our scheme uses a data-serializing operator to ensure that all replicas process data in the same order, and thus remain consistent in the absence of failures. To regain consistency after a failure heals, we experimentally compare approaches based on checkpoint/redo and undo/redo techniques and illustrate the performance trade-offs between these schemes.

[PDF (716KB)]

Bibtex Entry:

@inproceedings{balazinska2005fault-tolerance,
   author =       "Magdalena Balazinska and Hari Balakrishnan and Samuel Madden and Mike Stonebraker",
   title =        "{Fault-Tolerance in the Borealis Distributed Stream Processing System}",
   booktitle =    {ACM SIGMOD Conf.},
   year =         {2005},
   month =        {June},
   address =      {Baltimore, MD}
}