以LoRA權重基跨越視覺類比空間

摘要

視覺類比學習透過示範而非文字描述實現圖像操控，使使用者能夠指定難以用言語闡述的複雜變換。給定三元組 {a, a', b}，目標是生成 b' 以實現 a : a' :: b : b' 的類比關係。現有方法採用單一低秩自適應（LoRA）模組將文字生成圖像模型適配至此任務，但存在根本性限制：試圖透過固定自適應模組捕捉多樣化的視覺變換空間，會制約其泛化能力。受近期研究啟發（該研究顯示受限領域中的LoRA可構成具語義意義且可插值的空間），我們提出LoRWeB新方法，透過動態組合已學習的變換基元（非正式而言即於「LoRA空間中選擇點」），在推理階段針對每個類比任務專屬化模型。我們引入兩個關鍵組件：(1) 可學習的LoRA模組基座，用於涵蓋不同視覺變換的空間；(2) 輕量級編碼器，能根據輸入類比對動態選擇並加權這些基座LoRA。全面評估表明，我們的方法達到了最先進性能，並顯著提升對未見過視覺變換的泛化能力。研究結果顯示，LoRA基座分解是實現靈活視覺操控的可行方向。程式碼與資料詳見：https://research.nvidia.com/labs/par/lorweb

English

Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet {a, a', b}, the goal is to generate b' such that a : a' :: b : b'. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb

以LoRA權重基跨越視覺類比空間

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

摘要

Support