To understand "why real distributed systems
(still) experience failures", we analyzed hundreds of real failures from widely used distributed systems.
Surprisingly, almost all of the catastrophic failures are caused by incorrect handling of non-fatal
errors, and many of the bugs are trivial. We developed a simple static checker, Aspirator, capable of locating these bugs. It
discovered over 200 serious new bugs in Hadoop, HBase, ZooKeeper, etc., that are already fixed.