RaysUp: 幾何認識光線表現による超軽量汎用特徴量アップサンプリング

要旨

事前学習済み視覚基盤モデル（VFM）は、その強力な意味表現と高い汎化能力により、現代のコンピュータビジョンにおいて中心的な役割を担っている。しかし、これらのモデルが出力するパッチ化またはプールされた特徴量は本質的に低解像度であり、詳細なピクセルレベルの推論を必要とするタスクにおいてその有効性が制限される。既存の特徴量アップサンプリング手法は、意味的忠実度を低下させるか、VFM固有の再学習や重いアーキテクチャに依存するため、効率性とスケーラビリティを妨げている。これらの課題に対処するため、我々はRaysUpを提案する。これは超軽量でタスク非依存かつVFM非依存の特徴量アップサンプリングフレームワークであり、任意の解像度で高解像度の特徴マップを再構成する。従来の2次元補間やアテンションベースの手法とは異なり、RaysUpは特徴量再構成を幾何認識光線領域に持ち上げる。具体的には、方向認識ガイダンスエンコーディングのための空間分離ガイダンスエンコーダ、解像度フレキシブルな再構成のための任意解像度クロスアテンション機構、そして6次元Plücker光線座標を介して暗黙的3次元幾何学事前情報を注入する新たな光線位置エンコーディング（RayPE）を導入する。さらに、幾何認識近傍アテンションモジュールが、幾何的一貫性を維持しながらコンテンツ適応型の双方向集約を実現する。多様な高密度予測タスクにわたる広範な実験により、RaysUpはAnyUpのわずか16%のパラメータで最先端の性能を達成し、約7倍高速な推論を実現することが示された。これらの結果は、精度と効率のトレードオフを大幅に改善し、RaysUpを汎用特徴量アップサンプリングのための実用的でスケーラブルなソリューションとして確立する。コードはhttps://github.com/MAP-RaysUp/RaysUpで公開されている。

English

Pre-trained Vision Foundation Models (VFMs) have become central to modern computer vision due to their powerful semantic representations and strong generalization ability. However, their patchified or pooled outputs are inherently low-resolution, limiting their effectiveness in tasks requiring fine-grained, pixel-level reasoning. Existing feature upsampling approaches either degrade semantic fidelity or rely on VFM-specific retraining and heavy architectures, hindering efficiency and scalability. To address these challenges, we propose RaysUp, an ultra-lightweight, task-agnostic, and VFM-agnostic feature upsampling framework that reconstructs high-resolution feature maps at arbitrary resolutions. Unlike conventional 2D interpolation or attention-based schemes, RaysUp lifts feature reconstruction into a geometry-aware ray domain. Specifically, we introduce a Spatially Decoupled Guidance Encoder for direction-aware guidance encoding, an Any-Resolution Cross-Attention mechanism for resolution-flexible reconstruction, and a novel Ray Positional Encoding (RayPE) that injects implicit 3D geometric priors via 6D Plucker ray coordinates. Finally, a Geometry-Aware Neighborhood Attention module further ensures content-adaptive bilateral aggregation while preserving geometric consistency. Extensive experiments across diverse dense prediction tasks demonstrate that RaysUp achieves state-of-the-art performance while using only 16% of the parameters of AnyUp and delivering approximately 7x faster inference. These results highlight a substantially improved accuracy-efficiency trade-off and establish RaysUp as a practical and scalable solution for universal feature upsampling. Code is available at https://github.com/MAP-RaysUp/RaysUp.