Faculty Recruiting Support CICS

Automating Failure Diagnosis for Distributed Systems

14 Apr
Tuesday, 04/14/2020 2:30pm to 3:30pm
Virtual
Seminar
Speaker: Yongle Zhang

To view this live seminar via Zoom visit: https://zoom.us/j/9274672434

A password is now required to attend this event; if you did not receive it via email, please contact Joyce Mazeski at jmazeski@cs.umass.edu or Randy Barrios at events@cs.umass.edu.

Abstract: Distributed software systems have become the backbone of Internet services. Consequently, Failures in production distributed systems have severe consequences. A 63-minute outage of Amazon in 2018 caused a 100-million loss in revenue. Moreover, the frequency of failures rises with the increasing complexity of software systems. 2019 has experienced noticeably more Internet outages and is sometimes considered as the "year of outages".

Diagnosing such failures in distributed systems at data center scale is a particularly critical, yet notoriously difficult task because these systems are complex: there are numerous threads, processes, and nodes communicating concurrently. Existing diagnosis techniques are either intrusive and incur non-negligible performance overhead in a production environment, or face scalability challenges when applied to complex software systems.

A promising approach is to replicate how developers diagnose these failures. Guided by this notion, this talk will describe two tools, namely Pensieve and Kairux, which automate two major tasks of failure diagnosis: failure reproduction and root cause localization. Given the logs and code of a distributed system that has failed (in production), Pensieve is capable of formulating a minimal set of operations necessary to reproduce the failure, and Kairux can further pinpoint the single instruction that is the root cause.

Biography: Yongle Zhang is a PhD candidate in the Distributed Systems Research Group at the University of Toronto, working with Prof. Ding Yuan. His research interest is in systems software with a focus on improving the reliability and availability of complex, real-world systems.

 

Faculty Host
: