Digitizing Cities: MIT City Scanning Project -- Fully Automated, Model Acquisiition in Urban Areas

Principal Investigator Fredo Durand

Project Website http://city.csail.mit.edu/city.html

The "computer vision" or "machine vision" problem is a long-standing, difficult problem. Fundamentally, it is: how can a computer algorithm, alone or in concert with a human, produce a useful computational representation of a scene, given one or more images of that scene? Fully hundreds of researchers have developed machine vision algorithms over several decades. However, they have employed widely different definitions of the problem, leading to an enormous variety of problem statements and proposed solutions, and even degrees of automation of proposed solutions.

Researchers have developed algorithms to recognize common patterns (e.g., road networks) or objects (e.g., vehicles) in images, or identify spatial or other relationships among such entities. Others have sought to deduce physical structure from images; for example, the topography of some imaged land area. Still others have used machine vision techniques to monitor automated processes (e.g., robotic manufacturing systems, assembly lines) and human activities (e.g., immigration areas, mall parking lots). Machine vision systems have been incorporated into autonomous and semi-autonomous vehicles, such as cars and helicopters.

In all of these contexts, the machine vision problem has proved difficult from both theoretical and practical standpoints. Images are fundamentally ambiguous; they flatten into two dimensions, freeze in time, then clip to fit into a frame, a fully 3D, frameless, temporally continuous world. Cameras, whether film or digital, have low resolution, a small field of view, and limited dynamic range and lightness and color discrimination, compared to the complexity of surface detail, lighting variation, and dynamic range of reflected illumination found in the physical world. Lighting conditions in the real world are enormously complex, as are the surface appearances of even common objects. The usual difficulties of implementing robust numerical and geometric algorithms (ill-conditioned and/or degenerate data) only compound the hardness of realizing working machine vision algorithms.

To combat noise and overcome ambiguities in the input, machine vision systems often include humans in the loop, to provide initialization to optimization routines, or crucial identification of or matches between features in images, for example. This inclusion of human capabilities can produce visually impressive geometric models. However, these models are only useful from a restricted set of viewpoints! The most obvious example of this is viewing distance: one can't synthesize a (faithful) close-up view of an object unless one has observed it at close range. Supporting an unbiased viewpoint therefore requires viewing every scene surface at close range. Similar arguments can be made for view direction (to capture non-diffuse reflectance) and occlusion (to capture occluded portions of the surface). We seek to produce view-independent models, and so must acquire many thousands of images; this is too many for a human operator to manipulate.

Another consequence of incorporating a human in the loop is that it becomes harder to draw conclusions about what the algorithmic components of the system are capable of, in contrast to the system considered as a whole. Thus there is difficulty inherent not only in defining the problem, but solving it, and evaluating the proposed solution!

The state of the art is this. Automated 3D reconstruction systems have been demonstrated at small (sub-meter) scales, under controlled lighting conditions (e.g., indoors on a lab bench), for two or more images and one or a small number of objects. Until now, for example, no researcher has demonstrated an algorithm to register, by tracking or any other means, images acquired from positions extending completely around one or more buildings. When studied closely, algorithmic systems exhibit undesirable asymptotic behavior; for example, many proposed algorithms check all pairs of images for correlation, expending O(n^2) computation time in the size of the input. This imposes a strict scaling limit on the data complexity of the inputs that can be processed. Human-assisted reconstruction systems have been used to recover one or a few buildings (or portions of buildings), but the human is called upon for significant manual intervention, including initializing camera positions and block models of the scene, supplying low-level feature or block correspondences, indicating figure and ground (occlusion and subject). Because a human operator's capacity for work is limited, this too imposes a strict scaling limit on the complexity of the inputs (and outputs) that a human-assisted system can handle.

Until now, whether automated or human-assisted, no acquisition system has been demonstrated to scale to spatially extended, complex environments, under uncontrolled lighting conditions (i.e., outdoors) and in the presence of severe clutter and occlusion. The goal of our research and engineering efforts has been to design and demonstrate such a system by combining theoretical advances (new insights and techniques) with solid systems engineering (new sensors and plenty of visualization and validation) and assumptions about the environment (in our case, that urban scenes exhibit some regular geometric structure). We have developed a working, fully-automated prototype system and are now demonstrating it at various scales -- from a few buildings to several hundred, over a square kilometer -- on the MIT campus.

We chose to define our problem statement as simply as possible. Given images of an urban environment, we wish to produce a textured geometric CAD model of the structures in that environment. This model should include geometric representations of each structure and structural feature present, and radiometric information about the components of each such entity (e.g., a texture map or bi-directional reflectance distribution function for each surface). One might think of each image as a 2D "observation" of the world; we wish to produce a single, textured 3D CAD model which is consistent with the totality of our observations. Moreover, we wish to achieve a model accurate to a few centimeters, over thousands of square meters of ground area.