Scalable Autoregressive Monocular Depth Estimation

Abstract

This paper shows that the autoregressive model is an effective and scalable monocular depth estimator. Our idea is simple: We tackle the monocular depth estimation (MDE) task with an autoregressive prediction paradigm, based on two core designs. First, our depth autoregressive model (DAR) treats the depth map of different resolutions as a set of tokens, and conducts the low-to-high resolution autoregressive objective with a patch-wise casual mask. Second, our DAR recursively discretizes the entire depth range into more compact intervals, and attains the coarse-to-fine granularity autoregressive objective in an ordinal-regression manner. By coupling these two autoregressive objectives, our DAR establishes new state-of-the-art (SOTA) on KITTI and NYU Depth v2 by clear margins. Further, our scalable approach allows us to scale the model up to 2.0B and achieve the best RMSE of 1.799 on the KITTI dataset (5% improvement) compared to 1.896 by the current SOTA (Depth Anything). DAR further showcases zero-shot generalization ability on unseen datasets. These results suggest that DAR yields superior performance with an autoregressive prediction paradigm, providing a promising approach to equip modern autoregressive large models (e.g., GPT-4o) with depth estimation capabilities.

Motivation: Where is Autoregressive Objectives？

We exploit two “order” properties of the MDE task that can be transformed into two autoregressive objectives. (a) Resolution autoregressive objective: The generation of depth maps can follow a resolution order from low to high. For each step of the resolution autoregressive process, the Transformer predicts the next higher-resolution token map conditioned on all the previous ones. (b) Granularity autoregressive objective: The range of depth values is ordered, from 0 to specific max values. For each step of the granularity autoregressive process, we increase exponentially the number of bins (e.g., doubling the bin number), and utilize the previous predictions to predict a more refined depth with a smaller and more refined granularity.

Motivated by this, we propose DAR, which cast the MDE into autoregressive framework that aims to perform these two autoregressive processes simultaneously.

Motivation: Two 'order' properties that naturally become autoregressive objectives.

Overview: Our method

Sizes of model trees

An overview of DAR. We begin with encoding the input RGB images into image tokens as the context condition. At each step, DAR Transformer with the patch-wise causal mask performs autoregressive predictions, that is, it allows the input token map (upsampled from the previous resolution token map rk−1) to interact with only the prefix tokens and global image feature tokens for the next-resolution token map modeling. The output latent tokens are then sent to the ConvGRU module, which injects the prompts of new refined bin candidates ck (generated by MTBin from the previous prediction ̃Dk−1) for further granularity guidance and generates the next-resolution token map rk . The new depth map ̃Dk is generated by a linear combination of the next-granularity bin candidates ck and softmax value pk of the next-resolution token map, achieving concurrently a resolution and granularity autoregressive evolution.

Scalable MDE Model

Our DAR shows strong scalability and achieves better performance-efficiency trade-off among cutting-edge methods.

RMSE performances (↓) vs. model sizes on the KITTI dataset.

Results

We show the visualization performance of DAR on the MDE task. First, one can observe that our model performs better in depth estimation at the boundaries of the objects, making it more coherent and smooth (e.g., the back of the chair and the longdistant objects). This is helped by our autoregressive progressive paradigm, which maintains a coherent and smooth depth estimation when using previous predictions for the next-step, more refined prediction. Second, our DAR is much more accurate when estimating the depths of small and thin objects or long-distant visually relatively small objects, like the poles under the chair. These observations further demonstrate the superiority of our DAR. One can observe that DAR preserves fine-grained boundary details and generates more continuous depth values, further demonstrating the effectiveness of our new AR-based framework.

BibTeX

@article{wang2024scalable,
        title={Scalable Autoregressive Monocular Depth Estimation},
        author={Wang, Jinhong and Liu, Jian and Tang, Dongqi and Wang, Weiqiang and Li, Wentong and Chen, Danny and Chen, Jintai and Wu, Jian},
        journal={arXiv preprint arXiv:2411.11361},
        year={2024}
      }