一つのシーン、二つの深度：単眼基盤モデルにおける幾何学的曖昧性の探求

要旨

3次元世界を忠実に表現するためには、単一のカメラ光線に複数の可視かつ幾何学的に妥当な表面が含まれうる階層的ジオメトリを考慮する必要がある。しかし、単眼深度推定ではこの構造がピクセルあたり1つのスカラー深度に縮約される。透明なシーンはこの曖昧性を測定可能にする。同じ光線が前景のガラスを通り抜けて背景を観測できるため、教師あり学習のターゲットはシーンに内在する真理ではなく、アノテーション、データ、訓練の慣習となる。学習された予測器は、その深度層の選好としてこの慣習を顕在化させる。本稿では、深度層の選好と多層空間関係精度（ML-SRA）を計測するための疎な2層順序ベンチマークであるMultiDepth-3k（MD-3k）を導入する。MD-3kにおいて、主要な深度基盤モデルは標準的なRGB入力下で多様な層選好を示し、同一の階層的ジオメトリがモデル間で異なる形で解決されうることを明らかにする。さらに、訓練不要なスペクトル入力変換であるラプラシアンビジュアルプロンプティング（LVP）が、特定の固定モデルに対して報告される層を大幅に変更できることを発見した。最も強力なRGB/LVPペアであるDAv2-Lは75.5%のML-SRAを達成する。これらの結果は、深度基盤モデルが、標準的なRGB推論では表現されない相補的な幾何学的仮説を表現しうることを示唆している。本稿は、複数の有効な3D解釈を測定・保存・表現すべき幾何学的構造として扱う、曖昧性を考慮した視点を通じて深度の教師信号と評価を再考することをコミュニティに提案する。

English

A faithful 3D world representation should account for layered geometry, where a single camera ray may contain multiple visible and geometrically valid surfaces. Monocular depth estimation, however, reduces this structure to one scalar depth per pixel. Transparent scenes make this ambiguity measurable: the same ray can pass through foreground glass and observe the background, turning the supervised target into a convention of annotation, data, and training rather than a scene-intrinsic truth. A learned predictor exposes this convention as its depth-layer preference. We introduce MultiDepth-3k (MD-3k), a sparse two-layer ordinal benchmark for measuring depth-layer preference and multi-layer spatial relationship accuracy (ML-SRA). On MD-3k, leading depth foundation models exhibit diverse layer preferences under standard RGB input, showing that the same layered geometry can be resolved differently across models. We further find that Laplacian Visual Prompting (LVP), a training-free spectral input transformation, can substantially change the reported layer for certain frozen models. The strongest RGB/LVP pair, DAv2-L, reaches 75.5% ML-SRA. These results suggest that depth foundation models may express complementary geometric hypotheses that standard RGB inference leaves unexpressed. We invite the community to rethink depth supervision and evaluation through an ambiguity-aware lens, where multiple valid 3D interpretations are treated as geometric structure to be measured, preserved, and expressed.