OneHOI：统一人-物交互生成与编辑框架

摘要

人-物交互建模旨在捕捉人类对物体的作用关系，通常以<人物，动作，物体>三元组形式表示。现有方法分为两个独立分支：基于结构化三元组和布局的HOI生成方法虽能合成场景，但难以整合混合条件（如HOI与纯物体实体）；基于文本的HOI编辑方法虽能修改交互，却存在姿势与物理接触耦合、多交互扩展困难等局限。我们提出OneHOI——一个统一的扩散Transformer框架，通过共享的结构化交互表征将HOI生成与编辑整合至单一条件去噪过程。其核心关系扩散Transformer通过角色与实例感知的HOI令牌、基于布局的空间动作定位、强化交互拓扑的结构化HOI注意力机制，以及解耦多HOI场景的HOI旋转位置编码，实现对动词中介关系的建模。基于HOI-Edit-44K数据集联合HOI与物体中心数据集进行模态丢弃训练，OneHOI支持布局引导、无布局、任意掩码和混合条件控制，在HOI生成与编辑任务上均达到最先进性能。代码详见https://jiuntian.github.io/OneHOI/。

English

Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.