Entry Date:
July 1, 2021

Molecular Computing, Data Storage and Retrieval

Principal Investigator Mark Bathe

Project Start Date November 2021


The 4-letter ATGC code of DNA in our cells encodes approximately 1 gigabyte of information  per human genome, packaged up neatly within the nucleus of the cell. Synthetic DNA can similarly be used as a storage medium to contain files and other data in an extremely compact manner such that the entire world’s information could in principle fit in the palm of our hand if encoded in DNA. However, retrieving information or files from such “pools” of data encoded in DNA is a highly non-trivial task, since this information is in principle unstructured and disorganized. An analogy would be finding a page or chapter from a book in the US Library of Congress if all of its books were simply piled into the center of a football stadium. In this research area our lab is using DNA nanoparticles to organize and structure data and information stored in DNA, and developing ways to both randomly access arbitrary pools of data ranging from 1 MB to 1 GB from a pool of 1 Exabyte of data (1 Exabyte is 1 billion GB), as well as to compute using these molecular datasets, ranging from machine learning to data sorting and image recognition.