DSRG: Research

Non-intrusive Failure Diagnosis

This project aims to automate the diagnosis of failures and performance slowdowns solely using the unstructured logs output by production systems, without any modifications to production systems. We also build tools to automatically improve the quality of logs. Our log analysis tool has already been licensed and used by large IT companies, and we have contributed many log improvement patches to open-source systems.

JVM Warm-up Overhead in Distributed Systems

Most of today’s big data analytic systems choose to use the JVM. We found that, surprisingly, the JVM warm-up is frequently the bottleneck. For example, Spark queries spend an average of 21 seconds in warm-up. We implemented HotTub, a new JVM that eliminates warm-up overhead by reusing a pool of already warm JVMs across multiple applications.

Simple testing can prevent critical failures

To understand "why real distributed systems (still) experience failures", we analyzed hundreds of real failures from widely used distributed systems. Surprisingly, almost all of the catastrophic failures are caused by incorrect handling of non-fatal errors, and many of the bugs are trivial. We developed a simple static checker, Aspirator, capable of locating these bugs. It discovered over 200 serious new bugs in Hadoop, HBase, ZooKeeper, etc., that are already fixed.