Bioinformatics, Data Mining in Biotechnology

Principal Investigator Gregory Stephanopoulos

Bioprocess Improvement via Analysis of Historical Records
A primary concern for many industrial biotechnology companies is the difficulty in controlling and optimizing bioprocesses. While a tremendous wealth of online information has been collected, traditional modeling approaches have had limited success in utilizing this data. Some of the obstacles include: (1) many online measurements do not correspond to known key metabolites; (2) poor or incomplete understanding of the system's internal structure; and (3) information overload. In many cases, the first two obstacles are enough to limit most mathematical modeling as mechanistic (first principle) approaches cannot be used. The issue of information overload becomes problematic when using black box algorithms such as neural networks which may not be able to discriminate between informative and noninformative measurements. The solution proposed in our group is to apply data mining algorithms to identify and model patterns in the data that may be correlated to process outcome. Good fermentations are good because they share some intrinsic characteristics that differentiate them from bad ones. If the data collected contains these "fingerprints," the key is to identify what these characteristics are and if they can be related either to the organism's physiology or process operation. To this end, our data mining tools include algorithms from statistics, artificial intelligence, pattern recognition, and signal processing. The emphasis here is on knowledge discovery (data rich, information poor) and not hypothesis-verification (data poor, information rich) which is typical of conventional data analysis.

Data mining results indicate that 3 variables may be used to discriminate among subclasses of fermentation performance. Gene chip from Affymetrix; colored lights indicate the level
of gene expression.

Gene/Protein Pattern Recognition in Bioinformatics
In recent years, rapid development of technologies for the measurement of bioprocess variables as well as fundamental biological parameters has led to massive generation of data. These data, in the form of nucleic acid, protein, and metabolite profiles, along with sequence data resulting from extensive genomics research, will require powerful methodologies for unveiling the critical information resident in them. Upgrading the information content in such data will be the new theme of computer applications in biotechnology. New information discovered from combinations of variables will increase the acceptance of calculated results in the life sciences, and shift the focus from single parameter markers to multiple measurement patterns as descriptors of cellular behavior.