F-HOI:面向细粒度语义对齐的3D人-物交互
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
July 17, 2024
作者: Jie Yang, Xuesong Niu, Nan Jiang, Ruimao Zhang, Siyuan Huang
cs.AI
摘要
现有的3D人体物体交互(HOI)数据集和模型仅仅将全局描述与长HOI序列对齐,缺乏对中间状态和状态之间转换的详细理解。在本文中,我们认为细粒度语义对齐,利用状态级描述,为学习语义丰富的HOI表示提供了一种有前途的范式。为了实现这一目标,我们引入了Semantic-HOI,这是一个新数据集,包括超过20K个配对的HOI状态,每个HOI状态都有细致的描述,以及发生在两个连续状态之间的身体动作。利用提出的数据集,我们设计了三个状态级HOI任务,以实现HOI序列内的细粒度语义对齐。此外,我们提出了一个名为F-HOI的统一模型,旨在利用多模态指令,并赋予多模态大语言模型有效处理各种HOI任务的能力。F-HOI具有多重优势:(1)它采用支持多样多模态输入的统一任务制定。 (2)它在2D、3D和语言空间中保持HOI的一致性。 (3)它利用细粒度文本监督进行直接优化,避免对HOI状态进行复杂建模。大量实验证明,F-HOI有效地将HOI状态与细粒度语义描述对齐,熟练地处理理解、推理、生成和重建任务。
English
Existing 3D human object interaction (HOI) datasets and models simply align
global descriptions with the long HOI sequence, while lacking a detailed
understanding of intermediate states and the transitions between states. In
this paper, we argue that fine-grained semantic alignment, which utilizes
state-level descriptions, offers a promising paradigm for learning semantically
rich HOI representations. To achieve this, we introduce Semantic-HOI, a new
dataset comprising over 20K paired HOI states with fine-grained descriptions
for each HOI state and the body movements that happen between two consecutive
states. Leveraging the proposed dataset, we design three state-level HOI tasks
to accomplish fine-grained semantic alignment within the HOI sequence.
Additionally, we propose a unified model called F-HOI, designed to leverage
multimodal instructions and empower the Multi-modal Large Language Model to
efficiently handle diverse HOI tasks. F-HOI offers multiple advantages: (1) It
employs a unified task formulation that supports the use of versatile
multimodal inputs. (2) It maintains consistency in HOI across 2D, 3D, and
linguistic spaces. (3) It utilizes fine-grained textual supervision for direct
optimization, avoiding intricate modeling of HOI states. Extensive experiments
reveal that F-HOI effectively aligns HOI states with fine-grained semantic
descriptions, adeptly tackling understanding, reasoning, generation, and
reconstruction tasks.Summary
AI-Generated Summary