LoRAの重み基底による視覚的類推空間の横断

要旨

視覚的類推学習は、テキスト記述ではなく実演を通じて画像操作を可能にし、言葉で表現するのが困難な複雑な変換をユーザが指定できるようにする。三重項 {a, a', b} が与えられたとき、a : a' :: b : b' となるように b' を生成することが目標である。近年の手法は、単一の低ランク適応（LoRA）モジュールを使用してテキストから画像へのモデルをこのタスクに適応させるが、固定された適応モジュール内で多様な視覚変換の空間を捉えようとすることは、汎化能力を制限する根本的な課題に直面する。制約のある領域におけるLoRAが意味的かつ補間可能な空間を張ることを示した最近の研究に着想を得て、我々はLoRWeBを提案する。これは学習済み変換プリミティブの動的合成（非公式には「LoRAの空間」内での点選択）を通じて、推論時に各類推タスクに対してモデルを特殊化する新規アプローチである。我々は二つの主要コンポーネントを導入する：（1）異なる視覚変換の空間を張るための学習可能なLoRAモジュール基底、（2）入力の類推ペアに基づいてこれらの基底LoRAを動的に選択し重み付けする軽量エンコーダ。包括的評価により、本手法が最先端の性能を達成し、未見の視覚変換への汎化を大幅に改善することを実証する。我々の発見は、LoRA基底分解が柔軟な視覚操作のための有望な方向性であることを示唆する。コードとデータは https://research.nvidia.com/labs/par/lorweb で公開されている。

English

Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet {a, a', b}, the goal is to generate b' such that a : a' :: b : b'. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb

LoRAの重み基底による視覚的類推空間の横断

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

要旨

Support