ArtHOI: 基礎モデルを制御した手-把持物体インタラクションの単眼4D再構成

要旨

既存の手と物体のインタラクション（HOI）手法は、主に剛体オブジェクトに限定されており、関節を持つ物体の4次元再構成手法では、一般に物体の事前スキャン、さらにはマルチビュー動画が必要とされる。単眼RGB動画から人間と関節を持つ物体のインタラクションを4次元再構成することは、未開拓でありながら重要な課題である。幸いなことに、基盤モデルの最近の進展により、この非常に不良設定問題に対処する新たな機会がもたらされている。そこで我々は、複数の基盤モデルからの事前情報を統合・洗練させる最適化ベースのフレームワークであるArtHOIを提案する。我々の主な貢献は、これらの事前情報に内在する不正確さや物理的非現実性を解決するために設計された一連の新規手法である。特に、世界座標系において正規化メッシュを接地するための物体のメトリック尺度とポーズを最適化するAdaptive Sampling Refinement（ASR）法を導入する。さらに、接触推論情報を手と物体のメッシュ合成最適化の制約として利用する、マルチモーダル大規模言語モデル（MLLM）誘導型の手と物体の位置合わせ手法を提案する。包括的評価を可能にするため、我々は2つの新しいデータセット、ArtHOI-RGBDとArtHOI-Wildも提供する。大規模な実験により、多様な物体とインタラクションにわたる我々のArtHOIの頑健性と有効性が検証された。プロジェクト：https://arthoi-reconstruction.github.io。

English

Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. Project: https://arthoi-reconstruction.github.io.

ArtHOI: 基礎モデルを制御した手-把持物体インタラクションの単眼4D再構成

ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions

要旨

Support