The Underlay Project

Project Website https://underlay.mit.edu/

The Underlay is an open, distributed knowledge store that is architected to capture, connect, and archive publicly available knowledge and its provenance. The Underlay provides mechanisms for distilling the knowledge graph from openly available publications, along with the archival and access technology to make the data and content hosted on PubPub available to other platforms.

While knowledge production is accelerating, systems for sharing and assessing knowledge are falling behind1. Powerful collections of machine-readable knowledge are growing in importance each year, but most are privately owned (e.g., Google’s Knowledge Graph, Wolfram Alpha, Scopus). The Underlay aims to secure such a collection as a public resource. It also gives chains of provenance a central place in its data model, to help tease out bias or error that can appear at different layers of assumption, synthesis, and evaluation.

The Underlay aggregates statements and reported observations, along with citations of who made and who published them. For example, it would not contain the bare assertion that "Sudan’s population was 39M in 2008", but rather that "Sudan’s population was 'provisionally' 39M in 2008, according to the UN’s statistics division in 20112, referencing Sudan’s national census, as reported by its Central Bureau of Statistics, and as contested by the Southern People's Liberation Movement.3" It would also include estimates from different sources and years. This information, stored in a language-independent and machine-readable form, represents relationships between these entities: Sudan, the UN, the Sudanese statistics bureau, the liberation movement, Sudan’s population and census, and the relevant publication dates. The Underlay will also store information about how these statements were recorded.

While much knowledge is uncontested, the Underlay stores contested or contradictory statements, along with detailed context and chains of provenance. Evaluations of fidelity or accuracy can make use of this information, and can themselves be stored in other layers. The focus on provenance and iteration supports refinement, revision, and replication of observations. The structured granularity enables alignment of unrelated datasets, bulk analysis, and machine learning.

The Underlay team is developing the protocols, first instances, and governing rules of this knowledge graph. Information will be added at first by building focused, interpretive overlays -- knowledge curated for a particular audience. Overlays could for instance be journals, maps, or timelines, incorporating many sources of more granular information into a single lens4.

At first, every full node will maintain a copy of the Underlay, and copies will be stored on distributed filesystems. We are seeking partners and funders to help support the initial stage of this work, and plan to form a small foundation to oversee governance and protocol maintenance.

Phase 1 -- A proof-of-concept with two different communities of active users, supporting overlays for technical papers building on past research: a prior-art archive and a repository of academic papers5. We are recruiting companies and universities to contribute collections for this work. We will work with other protocol layers to define the initial federation model, and decentralize the underlay with IPFS. (Summer 2018)

Phase 2 -- A network of Underlay nodes at different institutions, demonstrating local vs global updating. An initial pipeline for extracting structured knowledge and sources from documents to populate lower layers. Tools to sync with existing structured repositories such as Wikidata, Freebase, and SHARE. And tools to visualize what is in the Underlay and how it is being used.