FlowBender: 自己修正条件付きフローのためのフィードバック認識型学習

要旨

条件付き拡散モデルやフローモデルは、そのタスクを定義する制約自体を満たすことにしばしば失敗する。例えば、深度条件付きモデルは、訓練時と推論時の両方で利用可能な順方向演算子（制約を定義する深度予測器）があるにもかかわらず、再抽出した深度が入力と一致しない画像を生成することが多い。既存のアプローチは一般に二つのカテゴリに分類される。すなわち、条件付け信号を静的な手がかりとして扱い、推論時に位置合わせ情報を無視する教師ありモデルと、手動調整された線形更新を通じて条件を参照するガイダンスベースの手法であり、後者は通常、条件への忠実性と生成サンプルの妥当性の間でトレードオフを行う。我々は、両方のパラダイムにおける根本的なギャップは、モデルが自身の位置合わせ誤差を利用するように訓練されることが決してないことにあると主張する。本論文では、この誤差を第一級の入力として扱う閉ループフレームワークであるFlowBenderを導入し、推論時のフィードバックに条件付けられた修正ポリシーを学習するようにネットワークを訓練する。各ステップにおいて、非ガイダンスの先読みパスがクリーンな信号を推定し、順方向演算子を介してタスク固有の偏差が計算され、リファインメントパスがこの信号を消費して修正された速度を生成する。我々は、微分可能演算子のための勾配ベースの定式化や、JPEG圧縮のような非微分可能設定のためのゼロ次変種を含む、FlowBenderのいくつかの変種を提案する。効率的なサンプリングのために、最小限の追加計算コストで閉ループ修正を可能にする事前ステップショートカットを導入する。画像間変換、復元、3Dメッシュテクスチャリングにおいて、FlowBenderは標準的な教師ありベースライン、位置合わせ損失を拡張した訓練、および最先端の推論時ガイダンスを一貫して上回り、忠実性と妥当性をトレードオフするのではなく同時に改善する。プロジェクトページ: https://flow-bender.github.io/

English

Conditional diffusion and flow models routinely fail to satisfy the very constraints that define their task. For instance, a depth-conditioned model often produces images whose re-extracted depth disagrees with the input, even though the forward operator--the depth predictor defining the constraint--is available during both training and inference. Existing approaches generally fall into two categories: supervised models that treat the conditioning signal as a static cue and ignore alignment information at inference, and guidance-based methods that consult it through hand-tuned linear updates, typically trading fidelity to the condition against the plausibility of the generated sample. We argue that the fundamental gap in both paradigms is that the model is never trained to utilize its own alignment error. We introduce FlowBender, a closed-loop framework that treats this error as a first-class input, training the network to learn a correction policy conditioned on inference-time feedback. At each step, an unguided look-ahead pass estimates the clean signal, a task-specific deviation is computed via the forward operator, and a refinement pass consumes this signal to produce a corrected velocity. We propose several variants of FlowBender, including a gradient-based formulation for differentiable operators and a zero-order variant for non-differentiable settings such as JPEG compression. For efficient sampling, we introduce a prior-step shortcut that enables closed-loop correction at a minimal additional computational cost. Across image-to-image translation, restoration, and 3D mesh texturing, FlowBender consistently outperforms standard supervised baselines, alignment-loss-augmented training, and state-of-the-art inference-time guidance, improving fidelity and plausibility simultaneously rather than trading them against each other. Project page: https://flow-bender.github.io/