F-HOI:朝向細粒度語義對齊的3D人物-物體互動
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
July 17, 2024
作者: Jie Yang, Xuesong Niu, Nan Jiang, Ruimao Zhang, Siyuan Huang
cs.AI
摘要
現有的3D人體物體交互(HOI)數據集和模型僅將全局描述與長HOI序列對齊,卻缺乏對中間狀態和狀態之間過渡的詳細理解。在本文中,我們認為細粒度語義對齊,利用狀態級描述,為學習語義豐富的HOI表示提供了一個有前途的範式。為了實現這一目標,我們引入了Semantic-HOI,一個新的數據集,包含超過20K對的HOI狀態,每個HOI狀態都有細緻的描述,以及兩個連續狀態之間發生的身體運動。利用所提出的數據集,我們設計了三個狀態級HOI任務,以實現HOI序列內的細粒度語義對齊。此外,我們提出了一個統一模型稱為F-HOI,旨在利用多模式指令,並賦予多模式大型語言模型有效處理多樣HOI任務的能力。F-HOI具有多個優勢:(1)它採用統一的任務制定,支持多功能多模式輸入的使用。 (2)它在2D、3D和語言空間中保持HOI的一致性。 (3)它利用細粒度文本監督進行直接優化,避免對HOI狀態進行複雜建模。大量實驗顯示,F-HOI有效地將HOI狀態與細粒度語義描述對齊,巧妙應對理解、推理、生成和重建任務。
English
Existing 3D human object interaction (HOI) datasets and models simply align
global descriptions with the long HOI sequence, while lacking a detailed
understanding of intermediate states and the transitions between states. In
this paper, we argue that fine-grained semantic alignment, which utilizes
state-level descriptions, offers a promising paradigm for learning semantically
rich HOI representations. To achieve this, we introduce Semantic-HOI, a new
dataset comprising over 20K paired HOI states with fine-grained descriptions
for each HOI state and the body movements that happen between two consecutive
states. Leveraging the proposed dataset, we design three state-level HOI tasks
to accomplish fine-grained semantic alignment within the HOI sequence.
Additionally, we propose a unified model called F-HOI, designed to leverage
multimodal instructions and empower the Multi-modal Large Language Model to
efficiently handle diverse HOI tasks. F-HOI offers multiple advantages: (1) It
employs a unified task formulation that supports the use of versatile
multimodal inputs. (2) It maintains consistency in HOI across 2D, 3D, and
linguistic spaces. (3) It utilizes fine-grained textual supervision for direct
optimization, avoiding intricate modeling of HOI states. Extensive experiments
reveal that F-HOI effectively aligns HOI states with fine-grained semantic
descriptions, adeptly tackling understanding, reasoning, generation, and
reconstruction tasks.Summary
AI-Generated Summary