Teaser Image

Dense and accurate depth estimation is essential for robotic manipulation, grasping, and navigation, yet currently available depth sensors are prone to errors on transparent, specular, and general non-Lambertian surfaces. To mitigate these errors, large-scale monocular depth estimation approaches provide strong structural priors, but their predictions can be potentially skewed or mis-scaled in metric units, limiting their direct use in robotics. Thus, in this work, we propose a training-free depth grounding framework that anchors monocular depth estimation priors from a depth foundation model in raw sensor depth through factor graph optimization. Our method performs a patch-wise affine alignment, locally grounding monocular predictions in metric real-world depth while preserving fine-grained geometric structure and discontinuities. To facilitate evaluation in challenging real-world conditions, we introduce a benchmark dataset with dense scene-wide ground truth depth in the presence of non-Lambertian objects. Ground truth is obtained via matte reflection spray and multi-camera fusion, overcoming the reliance on object-only CAD-based annotations used in prior datasets. Extensive evaluations across diverse sensors and domains demonstrate consistent improvements in depth performance without any (re-)training.

Code

This work is released under under the GPLv3 license. For any commercial purpose, please contact the authors. A software implementation of this project can be found on GitHub upon publication.

Publications

If you find our work useful, please consider citing our paper(s):

Simon Dorer, Martin Büchner, Nick Heppert, Abhinav Valada
AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs
Under review, 2026.

(PDF) (BibTeX)

Authors

Simon Dorer

Simon Dorer

University of Freiburg, Zuse School ELIZA

Martin Büchner

Martin Büchner

University of Freiburg

Nick Heppert

Nick Heppert

University of Freiburg, Zuse School ELIZA

Abhinav Valada

Abhinav Valada

University of Freiburg

Acknowledgment

This work was partially funded by the Carl Zeiss Foundation with the ReScaLe project.

Simon Dorer and Nick Heppert are supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) through the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research.