Synthesis from Program Traces: Data-Driven Synthesis

Project Website http://groups.csail.mit.edu/cap/wiki/doku.php?id=datadriven%3Astart

Modern programming relies heavily on rich extensible frameworks that pack large amounts of functionality for programmers to draw upon. The advent of these frameworks has revolutionized software development, making it possible to write applications with rich functionality by simply piecing together pre-existing components with relatively little new code. But the productivity benefits come at a price: a steep learning curve as programmers struggle to master a complex framework with tens of thousands of components and millions of lines of code. Eclipse, for example, has over 33 million lines of code. The challenge of scale is compounded by the flexibility of these frameworks, which is manifested in the extensive use of dynamic dispatch and reflection. Together, these features make detailed semantic analysis impractical, limiting the applicability of traditional synthesis methodologies. On the other hand, the reusable nature of the framework implies that the same functionality is used in different combinations by different applications; this means that it is possible for the synthesizer to discover the correct usage of a component by analyzing how it has been used by different clients in different contexts.

The problem is especially acute in object-oriented frameworks because of the way functionality is atomized into large number of classes and methods, each dealing with a very specific aspect of the computation. Erich Gamma and Kent Back in their book on writing Eclipse plug-ins go as far at to suggest the following rule: Monkey See/Monkey Do Rule: Always start by copying the structure of a similar plug-in.

We are building a new synthesis tool for Java that is based on this empirical approach to synthesis. The tool is designed to automatically synthesize a template for the glue code that enables interaction between many components of the system.

The system consists of the following components:
(*) DeLight data layer is responsible for collecting and indexing program traces and provides a query API for building program understanding and synthesis tools
(*) MatchMaker is the synthesis front-end that uses the data layer to synthesize glue code templates