Finding Bugs Using Your Own Code: Detecting Functionally-similar yet Inconsistent Code

Abstract

Probabilistic classification has shown success in detecting known types of software bugs. However, the works following this approach tend to require a large amount of specimens to train their models. We present a new machine learning-based bug detection technique that does not require any external code or samples for training. Instead, our technique learns from the very codebase on which the bug detection is performed, and therefore, obviates the need for the cumbersome task of gathering and cleansing training samples (e.g., buggy code of certain kinds). The key idea behind our technique is a novel two-step clustering process applied on a given codebase. This clustering process identifies code snippets in a project that are functionally-similar yet appear in inconsistent forms. Such inconsistencies are found to cause a wide range of bugs, anything from missing checks to unsafe type conversions. Unlike previous works, our technique is generic and not specific to one type of inconsistency or bug. We prototyped our technique and evaluated it using 5 popular open source software, including QEMU and OpenSSL. With a minimal amount of manual analysis on the inconsistencies detected by our tool, we discovered 22 new unique bugs, despite the fact that many of these programs are constantly undergoing bug scans and new bugs in them are believed to be rare.

Publication
Proceedings of the 30th USENIX Security Symposium
Mansour Ahmadi
Mansour Ahmadi
Postdoc (2018-2020)

My research is focused on applying machine learning to systems security, especially vulnerability discovery, malware detection and classification.

Reza Mirzazade farkhani
Reza Mirzazade farkhani
Graduate Research Assistant

My research aims to discover software vulnerabilities and solve memory safety issues via hardware-assisted approaches.

Ryan Williams
Ryan Williams
Graduate Research Assistant

My research interests are in the applications of formal methods to program analysis.