Entry Date:
September 28, 2002

Project IRIS (Infrastructure for Resilient Internet Systems): Scalable, Robust, Peer-to-Peer Systems

Principal Investigator M Kaashoek

Co-investigators Hari Balakrishnan , Barbara Liskov , David Karger

Project Website http://project-iris.net/


The Internet has demonstrated the power of a global naming and communications infrastructure: each new application need not implement its own network, but can simply assume a shared global communication infrastructure. The Infrastructure for Resilient Internet Systems (IRIS) project is developing a decentralized infrastructure, based on distributed hash tables (DHTs), that will enable a new generation of large-scale distributed applications. DHTs are scalable, achieving large system sizes without incurring undue overhead. They are self-configuring, automatically incorporating new nodes without manual intervention or oversight. They provide a simple and flexible interface and are simultaneously usable by many applications.

Distributed hash tables provide two key functions. First, they allow nodes to agree upon and efficiently locate a rendezvous point for a given object without central coordination - this is done using protocols such as Chord and Accordion. Our recent research has sought to build real implementations of these protocols and to ensure that they efficiently use available (but constrained) resources. Second, DHTs provide a simple storage interface, allowing applications to put a data object and then get the object at a later time using a key provided by the DHT. The implementation of this is called DHash.

We are building applications that take advantage of this infrastructure to provide benefits and features that would otherwise be difficult to achieve. Among these are a low-overhead Usenet server and a distributed and cooperative research library.

Maintaining data under churn in DHash -- DHash is a system for providing block storage, built using the Chord lookup protocol to organize nodes and data. DHash aims to provide efficient reads and writes, data integrity and data availability. That is, it seeks to be usable on a day-to-day basis. Early work defined the basic interface and addressed the problem of data integrity by relating the lookup key to the stored data using a cryptographic hash function or public key signature. Subsequent work focused on reducing lookup latency and achieving high throughput.

Current work tackles the problem of maintaining the durability and availability of data. There is great potential for this, due to the large number of distributed participants and their contributed storage and bandwidth resources. If these resources could be efficiently harnessed, DHash could provide robustness and a high storage capacity. The challenge is to deal with failures: nodes may go offline, leave the system permanently, and occasionally lose the contents of their disks due to disk failure.

Replication is traditionally used to ensure that data remains both available (reachable on the network) and durable (stored intact, but perhaps not reachable). To manage failures, DHash must have an intelligent strategy for managing replicas. Availability is a harder goal than durability: availability may be threatened by any network, processor, software, or disk failure, while durability is usually only threatened by disk failures. Local fault-tolerant storage systems such as RAID provide availability and durability relatively inexpensively since disk failures are rare and plenty of local network or bus bandwidth is available to create new replicas. In a large distributed system, failures become more common and available bandwidth becomes more expensive. Further, it is difficult to distinguish between temporary failures (that do not affect durability) and permanent failures (that do).

To handle failures efficiently, we have modeled the failure and repair processes to better understand the parameters that affect data durability. The model captures the concept that repairs and failures occur at a given rate and shows the relationship between the number of existing replicas and these rates. We have shown that threshold based maintenance schemes can be tuned to essentially respond only to permanent failures without explicitly distinguishing permanent and transient failures: by creating a new replica on any observed failure but always tracking the location of all replicas, a system will create a set of extra replicas in response to transient failures that can mask the impact of these transient failures. That it is, sufficiently many copies are made so that the number of available copies is usually greater than the repair threshold.

We are also investigating the utility of performing maintenance continuously instead of simply in response to observed failures. A continuous maintenance scheme might operate at a fixed but low rate over long periods --- the ideal would be to produce at a rate that results in at least the same degree of durability as a reactive system but spread out the usage over time. This will reduce burstiness in bandwidth usage which can translate into reduced costs.

We are using DHash as the back-end for two applications with high data storage requirements. As we scale up these applications, the amount of data stored in DHash and the number of nodes participating in DHash increases as well. The result of this is that our implementation is being constantly adjusted, improved and evaluated in the context of our applications.

Application: UsenetDHT -- UsenetDHT is a DHT-based replacement for the Usenet messaging system. UsenetDHT presents the existing Usenet interface to clients but reduces storage and bandwidth costs by storing articles in a DHT rather than replicating them at each server. In UsenetDHT, the bandwidth required to support read and writes for the full Usenet feed is around 200KB/s; additionally, simulations suggest that to maintain the availability of this data in a relatively stable environment is on the order of 1-2Mbps. By contrast, a traditional Usenet server must dedicate around 20MB/s to carry a full feed.

Application: OverCite -- OverCite is a new architecture for a distributed and cooperative research library based on a DHT. OverCite provides the same services as CiteSeer on a distributed set of nodes, eliminating the need for a single institution to contribute all of the resources necessary to run CiteSeer; the resource and bandwidth requirement of each participating node is significantly reduced. DHash serves as a distributed storage layer which, because of its robust and scalable models for data management and peer communication, allows the decentralization of the CiteSeer infrastructure and the inclusion of additional CPU and storage resources. Besides serving as a distributed, robust archive of data, DHash simplifies the coordination of distributed activities, such as crawling. Finally, DHash acts as a rendezvous point for producers and consumers of meta-data and documents.