Entry Date:
July 8, 2008

Technology Forecasting Using Data Mining and Semantics Systems


The planning and management of research and development activities is a challenging and data-intensive task. Research managers often rely on intuition and domain knowledge to arrive at management decisions. For example, funding decisions for NSF and NIH grants schemes are arrived at via peer review. While the role of expert opinion in decision making cannot be discounted, it still has a number of problems:

(*) Expert decisions are subjective
(*) Difficulty in identifying suitably qualified experts, and the high associated costs
(*) Difficulty in recording the reasons for and contexts in which decisions are made

This project focuses on the use of data mining and context mediation tools as a means of enriching this process via the incorporation of empirical information.

Approach:

(*) Apply technology-mining methods to study, visualize and ultimately predict the evolution of technological growth.
(*) Use a novel combination of data-mining and context-mediation techniques for improved performance
(*) Conduct a detailed case study in renewable energy technologies.

The focus so far has been on the development of tools which will support the execution of the rest of the project. In particular, two main issues need to be addressed:

(*) The identification of databases suitable for conducting preliminary investigations; while the long term aim is to study data extracted from a variety of sources, the initial focus will be on academic databases and blog search engines. We have reviewed and created tools for working with a number of online sites such as Google Scholar, Scirus, SpringerLink, ScienceDirect, ACM portal, IEEExploreand Google Blogs.
(*) Development of the following two classes of computer programs/tools are being actively pursued and working versions are already being used to analyze real data from the web (though development will continue throughout the project):

(1) For automating the extraction of publication information from online databases, a series of small programs have been developed using the Python programming language which allow publication counts to be automatically extracted from the databases listed above
(2) The tools needed to analyse the data extracted in terms of widely used growth models. The objective is to allow the rate of growth of technologies to be reliably measured, allowing growth potential to be estimated.

As an initial pilot study, our current experiments are based on the following ten energy-related topics: combustion, coal, battery, petroleum, fuel cells, wastewater, heat pump, engine, solar cells and power systems.

Initial results indicate:

(1) Data from online databases are noisy and might result in inconsistent findings, even though the sources of these sites may overlap significantly.
(2) Particularly, it is very difficult to compare results from academic and blog databases as information from these sources occur over very different time-scales.

Directions for the immediate future will evolve around improving the model estimation stages, studying the relationship between model parameters and growth potential and on methods for combining the results from multiple sites.