Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera

Introduction

Monocular depth estimation is crucial in applications like autonomous driving, robotics, and AR/VR. However, achieving accurate metric depth estimation across diverse camera types, such as fisheye and 360° cameras with large fields of view (FoVs), presents significant challenges. Existing methods trained on perspective images often fail to generalize effectively to large FoV cameras due to distortions and differences in camera parameters. Addressing these limitations is essential for enhancing the generalization and applicability of depth estimation systems.

Objective

The primary goal of the Depth Any Camera (DAC) framework is to enable zero-shot metric depth estimation across diverse camera types, including fisheye and 360° cameras, using a model trained exclusively on perspective images. This approach aims to provide a unified solution that mitigates the challenges posed by varying FoVs, distortions, and resolution inconsistencies during training and testing.

Methodology

Equi-Rectangular Projection (ERP): A unified image representation that maps images from different camera types into a shared space. This allows consistent processing across varying FoVs.
Image-to-ERP Conversion: Efficient conversion of input images to ERP patches using grid sampling and gnomonic projection. This process incorporates camera pitch awareness and enables online ERP augmentations.
FoV Alignment: A preprocessing step to adjust diverse FoV training samples to a predefined ERP patch size, ensuring efficient training while preserving essential content.
Multi-Resolution Training: Incorporating multiple resolutions in training to address resolution mismatches between ERP patches during training and large FoV testing images.
Depth Model and Loss Function: The DAC framework utilizes a simplified version of the iDisc architecture with cross-attention and self-attention mechanisms and trains the model using the SIlog loss function to optimize metric depth predictions.

Results

The DAC framework was tested across four large FoV datasets: Matterport3D, Pano3D-GV2, ScanNet++, and KITTI360. The results demonstrated the following:

Improved Accuracy: DAC significantly outperformed state-of-the-art methods like Metric3Dv2 and UniDepth, with improvements of up to 50% in δ1 accuracy.
Robust Generalization: The proposed framework achieved robust generalization to both indoor and outdoor scenes, surpassing existing baselines in metric depth estimation accuracy.
Ablation Studies: Ablation studies highlighted the critical role of FoV alignment and multi-resolution training in enhancing the generalization of DAC.
Qualitative Evaluations: DAC's ability to generate consistent and accurate depth maps for diverse camera types was showcased without requiring camera-specific fine-tuning.

Conclusion

The DAC framework provides a comprehensive solution for zero-shot metric depth estimation across diverse camera types. By leveraging Equi-Rectangular Projection, FoV alignment, and multi-resolution training, DAC bridges the gap between perspective-trained models and large FoV cameras. Its success paves the way for improved depth estimation in challenging real-world applications, offering a versatile and robust tool for downstream tasks in autonomous systems and beyond.

Contribution

This work was made possible through the collaborative efforts of Yuliang Guo, Lead Research Scientist at Bosch. I was primarily responsible for conducting all experiments, curating and cleaning datasets, and developing the model. My contributions included training and evaluating baseline models and the DAC framework, as well as generating visualizations, such as point clouds derived from monocular RGB image depth maps. Yuliang Guo provided invaluable guidance, insights, and feedback throughout the project, and played a pivotal role in designing the DAC pipeline. This research was conducted at Bosch Research, Sunnyvale, CA, USA.