Entry Date:
June 7, 2013

Borealis: Second Generation Stream Processing Engine

Principal Investigator Hari Balakrishnan

Co-investigators Samuel Madden , Michael Stonebraker


Over the last several years, a great deal of progress has been made in the area of stream processing engines (SPEs). Several groups have developed working prototypes and many papers have been published on detailed aspects of the technology such as stream-oriented languages, basic resource management, and resource-constrained one-pass query processing. While this work is an important first step, fundamental mismatches remain between the requirements of many streaming applications and the capabilities of first-generation systems.

In the Borealis project, we identify and address the following shortcomings of current stream processing techniques:

(*) Distributed, highly-available operation: Distributed operation across a cluster of commodity machines is the most economical way of achieving high scalability and availability, two key design concerns common to many streaming domains. Furthermore, typical streaming workloads exhibit significant bursts and variations over all time-scales, requiring the ability to dynamically distribute load and deal with transient overloads. We are investigating novel distributed architectures and algorithms customized on the basis of the requirements and characteristics of stream processing applications. Examples include load distribution algorithms that are tolerant of load spikes and high availability approaches that enable parallel, low latency recovery.

(*) Dynamic revision of query results: In many real-world streams, corrections or updates to previously processed data are available only after the fact. For instance, many popular data streams, such as the Reuters stock market feed, often include so-called revision records, which allow the feed originator to correct errors in previously reported data. Furthermore, stream sources (such as sensors), as well as their connectivity, can be highly volatile and unpredictable. As a result, data may arrive late and miss its processing window, or may be ignored temporarily due to an overload situation. In all these cases, applications are forced to live with imperfect results, unless the system has means to revise its processing and results to take into account newly available data or updates.

(*) Dynamic query modification: In many stream processing applications, it is desirable to change certain attributes of the query at run time. For example, in the financial services domain, traders typically wish to be alerted of interesting events, where the definition of ``interesting'' (i.e., the corresponding filter predicate) varies based on current context and results. In network monitoring, the system may want to obtain more precise results on a specific subnetwork, if there are signs of a potential Denial-of-Service attack. Finally, in a military stream application that Mitre explained to us, they wish to switch to a ``cheaper'' query when the system is overloaded. For the first two applications, it is sufficient to simply alter the operator parameters (e.g., window size, filter predicate), whereas the last one calls for altering the operators that compose the running query.

(*) Flexible and highly-scalable optimization: Currently, commercial stream processing applications are popular in industrial process control, financial services, and network monitoring. Here we see a server heavy optimization problem -- the key challenge is to process high-volume data streams on a collection of resource-rich ``beefy'' servers. Over the horizon, we see a very large number of applications of wireless sensor technology (e.g., RFID in retail applications, cell phone services). Here, there is a sensor heavy optimization problem -- the key challenges revolve around extracting and processing sensor data from a network of resource-constrained "tiny'" devices. Further over the horizon, we expect sensor networks to become faster and increase in processing power. In this case the optimization problem becomes more balanced, becoming sensor heavy, server heavy. Thus, there will be a need for a more flexible optimization structure that can deal with a very large number of devices and perform sensor-heavy server-heavy resource management and optimization.