以LoRA權重基跨越視覺類比空間
Spanning the Visual Analogy Space with a Weight Basis of LoRAs
February 17, 2026
作者: Hila Manor, Rinon Gal, Haggai Maron, Tomer Michaeli, Gal Chechik
cs.AI
摘要
視覺類比學習透過示範而非文字描述實現圖像操控,使使用者能夠指定難以用言語闡述的複雜變換。給定三元組 {a, a', b},目標是生成 b' 以實現 a : a' :: b : b' 的類比關係。現有方法採用單一低秩自適應(LoRA)模組將文字生成圖像模型適配至此任務,但存在根本性限制:試圖透過固定自適應模組捕捉多樣化的視覺變換空間,會制約其泛化能力。受近期研究啟發(該研究顯示受限領域中的LoRA可構成具語義意義且可插值的空間),我們提出LoRWeB新方法,透過動態組合已學習的變換基元(非正式而言即於「LoRA空間中選擇點」),在推理階段針對每個類比任務專屬化模型。我們引入兩個關鍵組件:(1) 可學習的LoRA模組基座,用於涵蓋不同視覺變換的空間;(2) 輕量級編碼器,能根據輸入類比對動態選擇並加權這些基座LoRA。全面評估表明,我們的方法達到了最先進性能,並顯著提升對未見過視覺變換的泛化能力。研究結果顯示,LoRA基座分解是實現靈活視覺操控的可行方向。程式碼與資料詳見:https://research.nvidia.com/labs/par/lorweb
English
Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet {a, a', b}, the goal is to generate b' such that a : a' :: b : b'. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb