ORIGEN:文本到圖像生成中的零樣本三維方位定位
ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation
March 28, 2025
作者: Yunhong Min, Daehyeon Choi, Kyeongmin Yeo, Jihyun Lee, Minhyuk Sung
cs.AI
摘要
我們介紹了ORIGEN,這是首個在跨多個物體和多樣類別的文本到圖像生成中實現3D方向定位的零樣本方法。以往關於圖像生成中空間定位的研究主要集中在2D定位上,缺乏對3D方向的控制。為解決這一問題,我們提出了一種基於獎勵引導的採樣方法,利用預訓練的判別模型進行3D方向估計,並結合一步式文本到圖像生成流模型。雖然基於梯度上升的優化是獎勵引導的自然選擇,但它難以保持圖像的真實感。因此,我們採用了基於朗之萬動力學的採樣方法,該方法通過簡單地注入隨機噪聲來擴展梯度上升——僅需增加一行代碼。此外,我們引入了基於獎勵函數的自適應時間重縮放,以加速收斂。實驗結果表明,ORIGEN在定量指標和用戶研究中均優於基於訓練和測試時引導的方法。
English
We introduce ORIGEN, the first zero-shot method for 3D orientation grounding
in text-to-image generation across multiple objects and diverse categories.
While previous work on spatial grounding in image generation has mainly focused
on 2D positioning, it lacks control over 3D orientation. To address this, we
propose a reward-guided sampling approach using a pretrained discriminative
model for 3D orientation estimation and a one-step text-to-image generative
flow model. While gradient-ascent-based optimization is a natural choice for
reward-based guidance, it struggles to maintain image realism. Instead, we
adopt a sampling-based approach using Langevin dynamics, which extends gradient
ascent by simply injecting random noise--requiring just a single additional
line of code. Additionally, we introduce adaptive time rescaling based on the
reward function to accelerate convergence. Our experiments show that ORIGEN
outperforms both training-based and test-time guidance methods across
quantitative metrics and user studies.Summary
AI-Generated Summary