F-HOI：面向细粒度语义对齐的3D人-物交互

摘要

现有的3D人体物体交互（HOI）数据集和模型仅仅将全局描述与长HOI序列对齐，缺乏对中间状态和状态之间转换的详细理解。在本文中，我们认为细粒度语义对齐，利用状态级描述，为学习语义丰富的HOI表示提供了一种有前途的范式。为了实现这一目标，我们引入了Semantic-HOI，这是一个新数据集，包括超过20K个配对的HOI状态，每个HOI状态都有细致的描述，以及发生在两个连续状态之间的身体动作。利用提出的数据集，我们设计了三个状态级HOI任务，以实现HOI序列内的细粒度语义对齐。此外，我们提出了一个名为F-HOI的统一模型，旨在利用多模态指令，并赋予多模态大语言模型有效处理各种HOI任务的能力。F-HOI具有多重优势：（1）它采用支持多样多模态输入的统一任务制定。（2）它在2D、3D和语言空间中保持HOI的一致性。（3）它利用细粒度文本监督进行直接优化，避免对HOI状态进行复杂建模。大量实验证明，F-HOI有效地将HOI状态与细粒度语义描述对齐，熟练地处理理解、推理、生成和重建任务。

English

Existing 3D human object interaction (HOI) datasets and models simply align global descriptions with the long HOI sequence, while lacking a detailed understanding of intermediate states and the transitions between states. In this paper, we argue that fine-grained semantic alignment, which utilizes state-level descriptions, offers a promising paradigm for learning semantically rich HOI representations. To achieve this, we introduce Semantic-HOI, a new dataset comprising over 20K paired HOI states with fine-grained descriptions for each HOI state and the body movements that happen between two consecutive states. Leveraging the proposed dataset, we design three state-level HOI tasks to accomplish fine-grained semantic alignment within the HOI sequence. Additionally, we propose a unified model called F-HOI, designed to leverage multimodal instructions and empower the Multi-modal Large Language Model to efficiently handle diverse HOI tasks. F-HOI offers multiple advantages: (1) It employs a unified task formulation that supports the use of versatile multimodal inputs. (2) It maintains consistency in HOI across 2D, 3D, and linguistic spaces. (3) It utilizes fine-grained textual supervision for direct optimization, avoiding intricate modeling of HOI states. Extensive experiments reveal that F-HOI effectively aligns HOI states with fine-grained semantic descriptions, adeptly tackling understanding, reasoning, generation, and reconstruction tasks.

F-HOI：面向细粒度语义对齐的3D人-物交互

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

摘要

Support