ORIGEN: テキストから画像生成におけるゼロショット3D方向接地

要旨

我々は、複数のオブジェクトと多様なカテゴリにわたるテキストから画像生成における3D方向のグラウンディングを行う初のゼロショット手法であるORIGENを紹介する。これまでの画像生成における空間的グラウンディングの研究は主に2D位置決めに焦点を当てており、3D方向の制御が欠けていた。この問題に対処するため、我々は3D方向推定のための事前学習済み識別モデルと、ワンステップのテキストから画像生成フローモデルを用いた報酬誘導サンプリング手法を提案する。勾配上昇法に基づく最適化は報酬ベースの誘導において自然な選択肢であるが、画像のリアリズムを維持するのが困難である。代わりに、我々はランジュバン動力学を用いたサンプリングベースのアプローチを採用し、単にランダムノイズを注入することで勾配上昇を拡張する――これはわずか1行の追加コードで実現できる。さらに、収束を加速するために報酬関数に基づく適応的時間再スケーリングを導入する。我々の実験結果は、ORIGENが定量的指標とユーザスタディの両方において、学習ベースおよびテスト時誘導手法を上回ることを示している。

English

We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has mainly focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise--requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.

ORIGEN: テキストから画像生成におけるゼロショット3D方向接地

ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation

要旨

Support