Modular Program Analysis for Data Structure Consistency

Project Website http://people.csail.mit.edu/rinard/modular_analysis/

The goal of this project is to statically analyze the program to verify that its data structures remain consistent. We are interested in two kinds of consistency:

(*) Local Consistency: Each individual data structure (such as a tree, list, or hash table) maintains its representation invariant. For example, the representation invariant for a doubly-linked list states that following the next pointer then the prev pointer produces a pointer to the original list element. The representation invariant for a hash table states that each entry is correctly indexed. Representation invariants are typically quite detailed and universally true for all instantiations of the given data structure.
(*) Global Consistency: Global consistency properties cut across traditional encapsulation boundaries to involve multiple data structures. A global consistency property might state, for example, that two data structures contain disjoint sets of objects, or that one data structure contains a subset of the objects present in another data structure. Global consistency properties often capture some key requirement of the application domain that must be reflected in the way that the application represents data from that domain.

Two problems complicate the realization of this goal. The first problem is scalability -- analyses that are capable of verifying our target class of sophisticated consistency requirements simply do not scale to sizable programs and there is little hope that they will ever do so.

The second problem is diversity. There is an enormous range of data structures with very different consistency properties. Consider, for example, a list and a hash table. The list consistency properties involve multiple objects linked together with pointers, while the hash table consistency properties involve arrays and array indices. These different kinds of properties require analyses that focus on very different kinds of basic properties.

This project adopts a new perspective on the data structure consistency problem: we instead propose a technique that developers can use to apply multiple pluggable analyses to the same program, with each analysis applied to the modules for which it is appropriate. The key features of this technique include:

(*) Modular Analysis and Shared Objects: In our approach, the program contains a collection of modules, each of which encapsulates the implementation of some of the data structures. Instead of attempting to analyze the entire program, each analysis operates on a single module. By focusing each analysis on only those parts of the program that are relevant for the properties it is designed to verify, we enable the application of precise analyses to programs composed of multiple modules.

One factor that complicates this approach is the need for objects to participate in multiple data structures and therefore the need to share objects between modules analyzed by different algorithms. To eliminate the possibility that one module may corrupt another's data structures (and to ensure that each algorithm analyzes all of the relevant code), modules encapsulate fields {\em (and not objects):} the underlying language prevents one module from accessing the fields of another module. Each module therefore encapsulates all of the fields required to implement its data structure; objects that participate in multiple data structures from multiple modules contain fields from each of these modules.

(*) Sets and Interfaces: Each module has an implementation and an interface. It is the responsibility of the analysis to verify that the implementation correctly implements the interface. Instead of exposing the implementation details of the encapsulated data structure, the interface uses a collection of abstract sets to summarize the effect of each procedure. This collection of sets characterizes how objects participate in various data structures. For example, the interface for a linked list might have an abstract set that contains all of the objects in the linked list. The interface for the insert procedure would indicate that the procedure adds the inserted object into the set; the interface for the remove procedure would indicate a corresponding removal from the set.

(*) Abstraction Modules and Data Structure Consistency: Each analysis uses an abstraction module to establish the connection between the module's implementation and its interface. This abstraction module enables each analysis to translate the set membership properties of those objects that cross module boundaries back into concrete data structure properties. These concrete properties are often crucial for preserving the internal consistency of the data structures.

For example, a list insert procedure may require that an object to be inserted must not already be a member of the abstract set containing all of the objects in the list. This set membership property, in turn, may enable the analysis of the insert procedure to recognize that the object is not already present in the list, which may be the precondition required to preserve the internal consistency of the list data structure.

(*) Properties Involving Multiple Modules: Systems often maintain important data structure consistency invariants that involve multiple data structures and cross encapsulation boundaries. For example, some systems may require the sets of objects that participate in two data structures to be disjoint; others may require every object present in one data structure to also be present in another. Such invariants are typically violated (temporarily and legitimately) as modules execute a coordinated sequence of data structure updates. Because these invariants involve objects that participate in different data structures encapsulated in different modules analyzed by different analyses, the analyses must somehow interoperate if they are to successfully verify the invariant.

The approach uses analysis scopes to identify invariants that involve sets from multiple modules. Some of these modules are exported and can be invoked from outside the scope; the other modules may be invoked only from within the scope. When our analysis processes an exported module it ensures that, if the invariant holds upon entry to the exported modules, then it is correctly restored upon exit.

For invariants that involve multiple modules, the set abstraction enables different analyses to combine their results to verify that the program properly coordinates its data structure operations to preserve the invariant. This approach therefore enables the application of arbitrarily sophisticated analyses to the modules that require the precision such analyses can bring, while supporting the use of simpler, more efficient analyses in the remainder of the program.

Together, these features enable the focused application of a full range of precise analyses to programs that contain multiple data structures encapsulated in multiple modules. They promote the development of a range of pluggable analyses that developers can deploy as necessary to verify important data structure consistency properties. Abstract sets enable different analyses to communicate and interoperate to verify properties that cross module boundaries to involve multiple data structures. Our approach supports the appropriately verified participation of objects in multiple data structures, including patterns in which objects migrate sequentially through different data structures and patterns in which objects participate in multiple data structures simultaneously.