Entry Date:
January 25, 2017

Theoretic Structures of High Dimensional Data Decomposition

Principal Investigator Lizhong Zheng

Project Start Date September 2016

Project End Date
 August 2018

This project aims at developing an information theoretic view of the problems of dimension reduction and feature selection. The problem of feature selection is a core issue in processing high dimensional data. It is difficult from an information processing point-of-view mainly because when reducing high dimensional data into lower dimensional feature space, it is in general inevitable to incur irreversible information losses. In this work, the problem is formulated as a general lossy information processing problem. The solutions to this problem is efficient algorithms that can be used to choose informative features that are relevant universally to a family of inference tasks.

The goal of a general theoretic framework to this problem is to develop systematic understanding and uniform performance comparisons to the existing wide variety of practical solutions. The main technical merit lies in a new operational meaning of information metrics, which connects a large body of research on information theory to the challenges of high dimensional data analytics. A new geometric analysis approach is used in this work, which helps to visualize the problem of feature selections, and link the problem to the well-studied concept of the Hirschfeld-Gebelein-Renyi maximal correlation.

The key advantage of the proposed approach is its generality. It can be applied to any type of data, incorporate prior knowledge and side information, connect multiple platforms, follow computation and storage constraints, adapt to time-variations, etc., all based on the same theoretic principle. It is envisioned that such universality would lead to architectural changes in the area of data analysis, with a universal interface that separates the task of a data scientist, in information extraction, from the task of a specialist with domain knowledge, in collecting the data, providing the models, and interpreting the result.