Dr. Kalyan Veeramachaneni

Principal Research Scientist

Primary DLC

Laboratory for Information and Decision Systems

MIT Room: 32-D780

Areas of Interest and Expertise

Big Data
Machine Learning
Data-Driven Healthcare
Data-Driven Education

Research Summary

During the past three years Dr. Veeramachaneni has set out to answer a seemingly simple question: Why does it take so much time to process, analyze and derive insights from data? Veeramachaneni has ventured out into a number of domains and identified critical issues at the foundation of the way we interact with, work around barriers and materialize insights from data. Consequently, he has founded multiple long term projects with a vision of making human interaction with data easier so insights can be derived faster. In addition to simply scaling machine learning approaches, novel approaches, systems were required. These novel methods include scaling of processes that have "human-in-the-loop," identification of intermediary pre-processed data structures for re-use, and the creation of interfaces to exploit such intermediate structures. Ultimately, this has led Veeramachaneni to design approaches and methods for automating much of the data science endeavor

Dr. Veeramachaneni leads a group called Data to AI. The group is interested in Big data science and Machine learning, and is comprised of 20 members: postdoctoral fellows, graduate and undergraduate students. Current projects include:

(*) Deep Mining: large scale self learning data science platform

(*) Gigabeats: A large scale machine learning platform for physiological data mining

(*) MOOCdb: Advancing MOOC data science through collaborative frameworks

(*) Feature factory: Crowdsourcing feature discovery

(*) Bring your own learner(BYOL): Democratizing cloud use for machine learning

(*) MOINC: Machine learning on BOINC

(summary updated 7/2017)

Recent Work

  • Video

    2020 Digital Transformation - Kalyan Veeramachaneni

    April 27, 2020Conference Video Duration: 61:14
    When my entire team transitioned to working from home for the first time, we were able to figure out most things — like how to hold virtual meetings and get people the technical resources they needed — fairly quickly. But one thing was more difficult: maintaining access to data. For privacy reasons, much of the data we use lives in a secure machine. We can’t put it on the cloud or distribute it publicly. What happens if this machine crashes, and we’re not on campus to fix it? Although this particular situation is unique, it highlights something we’ve been thinking about for years: the importance of synthetic data. Synthetic data is generated from a statistical model in such a way as to emulate important properties of the real data. Over the past 5 years, our group has been working on creating synthetic data generators. In this webinar, we will present our current progress in this regard, explore some use cases in which synthetic data is helpful, and explain how you can use synthetic data to scale your digital operations.

    Kalyan Veeramachaneni - RD2017

    November 22, 2017Conference Video Duration: 31:48

    Build AI products faster, cheaper

    Artificial intelligence is being embedded into products to save people time and money. Experts in many domains have already begun to see the results of this, from medicine to education to navigation. But these products are built using an army of data scientists and machine learning experts, and the rate at which these human experts can deliver results is far lower than the current demand. My lab at MIT, called Data to AI, wanted to change this. Recognizing the human bottleneck in creating these systems, a few years ago we launched an ambitious project: we decided “to teach a computer how to be a data scientist." Our goal was to create automated systems that can ask questions of data, come up with analytic queries that could answer those questions, and use machine learning to solve them—in other words, all the things that human data scientists do. After much research and experimentation, the systems we have developed now allow us to build end-to-end AI products that can solve a new problem in one day. In this talk, I will cover what these new technologies are, how we are using them to accelerate the design and development of AI products, and how you can take advantage of them to actually build AI products faster and cheaper.

    2017 MIT Research and Development Conference

    Kalyan Veeramachaneni - 2016-Digital-Health_Conf-videos

    September 14, 2016Conference Video Duration: 36:36

    Rapid Discovery of Predictive Models from Large Repositories of Signals Data

    This talk is focused on the methods and technologies to answer the question ‘Why does it take a long time to process, analyze and derive insights from the data?’ Dr. Veeramachaneni is leading the ‘Human Data Interaction’ Project to develop methods that are at the intersection of data science, machine learning, and large scale interactive systems. With significant achievements in storage , processing, retrieval, and analytics, the answer to this question now lies in developing technologies that are based on intricately understanding the complexities in how scientists, researchers, analysts interact with data to analyze, interpret, and derive models from it. In this talk, Dr. Veeramachaneni will present how his team is building systems to transform this interaction for the signals domain using an example of physiological signals. Prediction studies on physiological signals are time-consuming: a typical study, even with a modest number of patients, usually takes from 6 to 12 months.

    In this talk, he will describe a large-scale machine learning and analytics framework, BeatDB, to scale and speed up mining predictive models from these waveforms. BeatDB radically shrinks the time an investigation takes by: (a) supporting fast, flexible investigations by offering a multi-level parameterization, (b) allowing the user to define the condition to predict, the features, and many other investigation parameters (c) pre-computing beat-level features that are likely to be frequently used while computing on-the-fly less used features and statistical aggregates.

    2016 MIT Digital Health Conference

    Kalyan Veeramachaneni - 2016-ICT-Conference

    April 27, 2016Conference Video Duration: 38:38

    Teaching a Computer to be a Data Scientist

    In recent years, great strides have been made to scale and automate Big Data collection, storage, and processing, but deriving real insight through relational and semantic data analysis still requires time-consuming guesswork and human intuition. Now, novel approaches designed across domains (education, medicine, energy, and others) have helped identify foundational issues in general data analysis, providing the basis for developing a “Data Science Machine,” an automated system for generating predictive models from raw data.

    2016 MIT Information and Communication Technologies Conference