VISTA: VIrtual STereo based Augmentation for Depth Estimation in Automated Driving

Abstract

Depth estimation is the primary task for automated vehicles to perceive the 3D environment. The classical approach for depth estimation leverages stereo cameras on the cars. This approach can provide accurate and robust depth estimation, but also requires a more expensive setup and detailed calibration. The recent trend of depth estimation, therefore, focuses on learning the depth from monocular videos. These approaches only need an easy setup but may also be vulnerable to occlusion or light condition changes in the scene. In this work, we propose a novel idea that exploits the fact that data collected by large fleets naturally contains scenarios where vehicles with monocular cameras drive close to each other and are looking at the same scene. Our approach combines the monocular view of the ego vehicle and the neighboring vehicle to form a virtual stereo pair during training, while still only requiring the monocular image during inference. With such a virtual stereo view, we are able to train self-supervised depth estimation by two sources of constraints- 1) the spatial and temporal constraints between sequential monocular frames; 2) the geometric constraints between the frames from two cameras that form the virtual stereo. Public datasets for multiple vehicles sharing the common view to form possible virtual stereo views do not exist, and so we also created our synthetic dataset using CARLA simulator where multiple vehicles can observe the same scene at the same time. The evaluation shows that our virtual stereo approach can improve the ego vehicle’s depth estimation accuracy by 8%, compared to the approaches that use monocular frames only.

Publication
In Machine Learning for Autonomous Driving, NeurIPS 2022 Workshop
Kshitiz Bansal
Kshitiz Bansal
Radar Engineer

My research interests include Radar Imaging, Computer Vision and Deep Learning.