컴패스 컨트롤: 텍스트-이미지 생성을 위한 다중 객체 방향 제어

초록

기존의 텍스트-이미지 확산 모델 제어 방법은 강력하지만, 객체 방향의 정밀한 제어와 같은 명시적인 3D 객체 중심 제어를 허용하지 않습니다. 본 연구에서는 텍스트-이미지 확산 모델에서 다중 객체 방향 제어 문제를 해결합니다. 이를 통해 각 객체에 대한 정밀한 방향 제어가 가능한 다양한 다중 객체 장면을 생성할 수 있습니다. 핵심 아이디어는 확산 모델을 텍스트 토큰과 함께 각 객체에 대한 방향 인식 나침반 토큰 세트로 조건화하는 것입니다. 경량 인코더 네트워크는 객체 방향을 입력으로 받아 이러한 나침반 토큰을 예측합니다. 이 모델은 단순한 배경 위에 하나 또는 두 개의 3D 자산이 포함된 절차적으로 생성된 장면으로 구성된 합성 데이터셋에서 훈련됩니다. 그러나 이 프레임워크를 직접 훈련하면 방향 제어가 미흡하고 객체 간에 얽힘이 발생합니다. 이를 완화하기 위해 생성 과정에 개입하여 각 나침반 토큰의 교차 주의 맵을 해당 객체 영역으로 제한합니다. 훈련된 모델은 a) 훈련 중 보지 못한 복잡한 객체와 b) 두 개 이상의 객체가 포함된 다중 객체 장면에 대해 정밀한 방향 제어를 달성할 수 있어 강력한 일반화 능력을 보여줍니다. 또한, 개인화 방법과 결합할 경우 우리의 방법은 다양한 맥락에서 새로운 객체의 방향을 정밀하게 제어합니다. 우리의 방법은 광범위한 평가와 사용자 연구를 통해 정량화된 최첨단 방향 제어와 텍스트 정렬을 달성합니다.

English

Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control, such as precise control of object orientation. In this work, we address the problem of multi-object orientation control in text-to-image diffusion models. This enables the generation of diverse multi-object scenes with precise orientation control for each object. The key idea is to condition the diffusion model with a set of orientation-aware compass tokens, one for each object, along with text tokens. A light-weight encoder network predicts these compass tokens taking object orientation as the input. The model is trained on a synthetic dataset of procedurally generated scenes, each containing one or two 3D assets on a plain background. However, direct training this framework results in poor orientation control as well as leads to entanglement among objects. To mitigate this, we intervene in the generation process and constrain the cross-attention maps of each compass token to its corresponding object regions. The trained model is able to achieve precise orientation control for a) complex objects not seen during training and b) multi-object scenes with more than two objects, indicating strong generalization capabilities. Further, when combined with personalization methods, our method precisely controls the orientation of the new object in diverse contexts. Our method achieves state-of-the-art orientation control and text alignment, quantified with extensive evaluations and a user study.

컴패스 컨트롤: 텍스트-이미지 생성을 위한 다중 객체 방향 제어

Compass Control: Multi Object Orientation Control for Text-to-Image Generation

초록

Support