Drinking from the Visual Firehose: High-Frame-Rate, High-Resolution Computer Vision for Autonomous and Assisted Driving

Principal Investigator Saman Amarasinghe

Co-investigators Fredo Durand , John Leonard

Project Website http://toyota.csail.mit.edu/node/37

Perception is a critical component of autonomous and assisted driving. Enormous amounts of data from cameras and other optical sensors (e.g. Lidar) need to be processed in real time and with as low a latency as possible to allow cars not only to locate themselves, but also to react in a fraction of a second to unexpected situations. Reaching this level of performance has traditionally been a struggle with large implementation costs. Taking full advantage of modern processing units requires large implementation effort to extract parallelism (multicore, vectorization, GPU threads) and leverage the cache hierarchy. At best it takes substantial implementation time and resources from robotics researchers and practitioners, at worst it leads to significantly sub-optimal performance. The two orders of magnitude between a direct implementation and a highly optimized one can mean the difference between a perception system that runs at 10Hz vs. 1000Hz. In robotics in general, and in high-stakes situations such as intelligent autonomous or assisted driving in particular, it is well accepted that the speed and latency of the control loop are critical to robustness. This work can dramatically facilitate high-performance perception code, with huge potential payoffs for research in robotic vehicle.

We plan to introduce the Halide programming language as the base language for developing all of the image processing and most of the computer vision algorithms needed in autonomous and assisted driving. Image processing and computer vision pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, sampling, feature recognition and stages with global or data-dependent access patterns. Halide enables practitioners to write much simpler and modular programs. Halide decouples the algorithm, which describes the computation, from the schedule which describes when, where and how to perform that computation. Thus, the programmers can write the algorithm once and easily iterate over different complex graph assemblies until they find the best optimization. Halide is shown to deliver performance often an oder of magnitude faster than the best prior hand-tuned C, assembly, and CUDA implementations of image processing pipelines. Halide programs are also portable across radically different architectures, from ARM mobile processors to massively parallel GPUs, by making changes only to the schedule, while traditionally-optimized implementations are highly specific to a single target architecture. They are also modular and composable, where traditional implementations have to fuse many operations into a monolithic whole for performance. Thus, we belive that moving an optimized image-vision pipeline from a vehicle simulator to an actual vehicle for testing or production will be fast and simple.