The FlexGP Project

Principal Investigator Una-May O'Reilly

Project Website http://groups.csail.mit.edu/EVO-DesignOpt/groupWebSite/index.php?n=Site.FlexGP

In a nutshell, the FlexGP project goal is scalable machine learning using genetic programming (GP).

Genetic programming is a mature, robust multi-point search technique (inspired by evolution) which supports readable, and flexibly specified learning representations which can readily express linear or non-linear data relationships. It is well suited to parallelization and machine learning. It has a strong record in real world domains.

The project designs and implements different distributed cloud models of genetic programming specialized for machine learning. Each instantiation is abstracted and programmed in such way that it can execute on a massively parallel platform - a private cloud, public cloud or open-infrastructure-volunteer compute network. It is able to elastically solve diverse problems that similarly require diverse solution approaches factored along conventional and new algorithm dimensions.

FlexGP is both the name of our project and the name of systems implemented within the scope of the project. In 2011-2012 we developed a FlexGP system which uses an island model supported by Java socket communication in addition to a FlexEA library which implements large-size, single-population island GP with Hadoop MapReduce. Both execute on Amazon's EC2. These were both reported in the 2012 EvoApplications track on Parallel Architectures and Distributed Infrastructures ("EvoPAR"). Our paper entitled FlexGP: Genetic Programming on the Cloud won the track's Best Paper Award. We also reported on the EC-Star platform which is a massive-scale, hub and spoke, distributed rule-based GP classification system at the 2012 Genetic Programming: Theory to Practice Workshop, X. FlexGP 2.0 is under development in 2012 as are advances to the EC-Star based platform. Updates, through publications, will be available in early 2013.

(Big) Data and FlexGP -- Perspective: There's a lot of hype about the exponential growth of data. We take the growth as obvious and are interested (obviously)in the opportunities BigData presents for FlexGP.

The FlexGP project stresses scalability and, in the data realm, that implies design of ML systems that can handle lots of data. In this context, we're avoiding the "BigData" buzzword intentionally. That's because we feel it's (just) a matter of scale and we've been thinking about that all along. We feel the one important aspect of the data scaling situation in ML can be summarized as follows: Now we have too much data. How can we make sure we don't waste time looking at too much of it? Just because there's more compute resources to munch it, we shouldn't gorge! so, how can we determine when have we looked at enough of it?

These questions are essential because large quantities of data invert the old ML perspective. Before large scale data, we worried about how we split training and test data because we didn't have enough. Now, we've got to ask: when can we shout "enough already!"? It is important to remember that ML is about generalization. We want to infer, from exemplars, properties that are accurate in unseen data similar to our exemplars. So, we have to be a bit cautious given that when a dataset is infinite in a practical sense, it may present all truth. Working in this regime is kind of counter to ML's point.

We think this context implies we need to sample intelligently. Part of that intelligence involves estimating the properties of the data we've observed and calculating how we can proceed, with known certainty, to learn with out-of sample reliability. On the cloud-based FlexGP platform, we are investigating how to determine when/how we can be confident that the rest of the data more or less follows the properties we've observed in our sampling. Another part of the intelligence is sampling efficiently. Here, on the cloud-based platform, we are investigating sampling ideas based on distributed scaling. We will try to keep ourselves honest by insisting we count how many times we touch the data and trying to minimize this without excessive loss of information. Our EC-Star platform has a scalable answer to massive scale data by distributed random sampling with efficient sampling coordination which introduces more and more data to only promising solutions identified thus far.

Another aspect of this research is investigating the effectiveness of massive data. In a limited data setting, one typically incorporates certain prior assumptions and insight about the data into the model (e.g. parametric). However, in the massive data setting, we can let the "data speak for itself" in order to discover truly data-driven knowledge and details beyond our insight. Thus, we are investigating novel and efficient representation and (nonparametric) modeling methods which ideally outperform in the massive (time-series) data setting. In addition, we also examine the question of the bias of the model versus the bias of the data in various scales of data.

You can never have too much data, but you can waste your time looking at too much of it. FlexGP systems exploit massive data by capitalizing on their computational scale and their scaling architectures (meaning, in FlexGP's case, our cloud or commercial-volunteer client resources).