ORIGEN：文本到圖像生成中的零樣本三維方位定位

摘要

我們介紹了ORIGEN，這是首個在跨多個物體和多樣類別的文本到圖像生成中實現3D方向定位的零樣本方法。以往關於圖像生成中空間定位的研究主要集中在2D定位上，缺乏對3D方向的控制。為解決這一問題，我們提出了一種基於獎勵引導的採樣方法，利用預訓練的判別模型進行3D方向估計，並結合一步式文本到圖像生成流模型。雖然基於梯度上升的優化是獎勵引導的自然選擇，但它難以保持圖像的真實感。因此，我們採用了基於朗之萬動力學的採樣方法，該方法通過簡單地注入隨機噪聲來擴展梯度上升——僅需增加一行代碼。此外，我們引入了基於獎勵函數的自適應時間重縮放，以加速收斂。實驗結果表明，ORIGEN在定量指標和用戶研究中均優於基於訓練和測試時引導的方法。

English

We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has mainly focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise--requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.

ORIGEN：文本到圖像生成中的零樣本三維方位定位

ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation

摘要

Support