Scalable Autoregressive Monocular Depth Estimation

1ZJU   2Ant Group   3University of Notre Dame   4HKUST (Guangzhou)
CVPR 2025

Corresponding Authors
MY ALT TEXT

We introduce the first autoregressive model for monocular depth estimation (MDE), called DAR — a simple, effective, and scalable framework. Our key insight lies in transforming two ordered properties in MDE, depth map resolution and granularity, into autoregressive objectives. By recast MDE as a coarse-to-fine autoregressive objective process, DAR can be easily scaled up to larger size to obtain better performance and generalization.

Abstract

This paper shows that the autoregressive model is an effective and scalable monocular depth estimator. Our idea is simple: We tackle the monocular depth estimation (MDE) task with an autoregressive prediction paradigm, based on two core designs. First, our depth autoregressive model (DAR) treats the depth map of different resolutions as a set of tokens, and conducts the low-to-high resolution autoregressive objective with a patch-wise casual mask. Second, our DAR recursively discretizes the entire depth range into more compact intervals, and attains the coarse-to-fine granularity autoregressive objective in an ordinal-regression manner. By coupling these two autoregressive objectives, our DAR establishes new state-of-the-art (SOTA) on KITTI and NYU Depth v2 by clear margins. Further, our scalable approach allows us to scale the model up to 2.0B and achieve the best RMSE of 1.799 on the KITTI dataset (5% improvement) compared to 1.896 by the current SOTA (Depth Anything). DAR further showcases zero-shot generalization ability on unseen datasets. These results suggest that DAR yields superior performance with an autoregressive prediction paradigm, providing a promising approach to equip modern autoregressive large models (e.g., GPT-4o) with depth estimation capabilities.

Motivation: Where is Autoregressive Objectives?

We exploit two “order” properties of the MDE task that can be transformed into two autoregressive objectives. (a) Resolution autoregressive objective: The generation of depth maps can follow a resolution order from low to high. For each step of the resolution autoregressive process, the Transformer predicts the next higher-resolution token map conditioned on all the previous ones. (b) Granularity autoregressive objective: The range of depth values is ordered, from 0 to specific max values. For each step of the granularity autoregressive process, we increase exponentially the number of bins (e.g., doubling the bin number), and utilize the previous predictions to predict a more refined depth with a smaller and more refined granularity.

Motivated by this, we propose DAR, which cast the MDE into autoregressive framework that aims to perform these two autoregressive processes simultaneously.

Weight distance correlation Motivation: Two 'order' properties that naturally become autoregressive objectives.

Largest Model Trees on Hugging Face

Sizes of model trees

We show the 10 largest Model Trees on Hugging Face. Our insight is that learning an expert for each tree greatly simplifies weight-space learning. This is a practical setting as a few large Model Trees dominate the landscape.

Probing Experts (ProbeX)

ProbeX Overview

Unlike conventional probing methods that operate only on inputs and outputs, our lightweight architecture scales weight-space learning to large models by probing hidden model layers. ProbeX begins by passing a set of learned probes, u_1,u_2,,u_r_U , through the input weight matrix X . A projection matrix V , shared between all probes, reduces the dimensionality of the probe responses, followed by a non-linear activation. Each probe response is then mapped to a probe encoding e_l via a per-probe encoder matrix M_l . We sum the probe encodings to obtain the final model encoding e , which the predictor head maps to the task output y .

Results

ProbeX achieves state-of-the-art results on the task of training category prediction, accurately identifying the specific classes within a model’s training dataset. Excitingly, ProbeX can also align fine-tuned Stable Diffusion weights with language representations. This capability enables a new task: zero-shot model classification, where models are classified via a text prompt describing their training data. Using these aligned representations, ProbeX can also perform model retrieval.

Model retrieval

BibTeX

@article{wang2024scalable,
        title={Scalable Autoregressive Monocular Depth Estimation},
        author={Wang, Jinhong and Liu, Jian and Tang, Dongqi and Wang, Weiqiang and Li, Wentong and Chen, Danny and Chen, Jintai and Wu, Jian},
        journal={arXiv preprint arXiv:2411.11361},
        year={2024}
      }