Entry Date:
January 25, 2017

Metadata Toolkits for Building Multi-Faceted Data - Relationship Models

Project Start Date October 2016

Project End Date
 September 2019


Scientific research is challenged by ever-larger, more complex data sets, stored in disparate form in complicated repositories, making it difficult to discover useful content. One reason is the relative scarcity of 'navigational' metadata - metadata that explicitly reveals the multitude of relationships between data elements. This project develops improved data management tools allowing data managers to create metadata schemas that reveal the multiple and complex relationships existing between data elements. The team develops these tools while collaborating directly with three different research communities: plasma physics (with the MIT Plasma Science and Fusion Center), ocean monitoring and modeling (with the MIT Department of Earth Sciences) and uncertainty quantification (with the University of Texas Institute for Computational Engineering and Sciences).

The project provides tools that allow data managers to easily develop metadata schemas that represent and expose the multiple and complex relationships that exist between data elements and which are typically not well represented in data systems. Such data elements include data source, provenance, physical properties represented in the data, data versioning, annotation threads, data dictionaries, data catalogs and data shape (which typically determines which applications can consume or display the data), and larger organizational entities such as research campaigns, experimental proposals and research products (e.g., publications, presentations and public databases). Schemas and data are manipulated through a Representational State Transfer - Application Programming Interface (RESTful API). Relationships among the data are represented as mathematical graph structures that are all built upon a common meta-schema. There is an emphasis on recording the full data lifecycle using a RESTful API and granular data object uniform resource identifier (URI) schema that facilitate instrumenting complex and varied workflows. A modern web based exploration tool is built upon these technologies in the initial application areas of plasma physics, ocean monitoring and modeling, and uncertainty quantification. By viewing meta-data and programs more generally as a collection of graphs whose nodes are the data files or records, the project creates a set of programs which can explore these graphs and make the system much more general and easily extensible. Also, by allowing users to create data objects at any level of specificity, the graphs of which the data is a member can be used to label object groupings. This ability to represent data relationships would be of use to a broad contingent of the scientific community and could be useful to the scientific enterprise in many domains.