跨越视觉类比空间：基于LoRA权重基的构建

摘要

视觉类比学习通过演示而非文本描述实现图像操控，使用户能够指定难以用语言精确表达的复杂变换。给定三元组{a, a', b}，其目标是生成b'，使得a与a'的关系类比于b与b'的关系。现有方法采用单一低秩自适应（LoRA）模块将文本到图像模型适配于此任务，但存在根本性局限：试图通过固定适配模块捕捉多样化的视觉变换空间会制约泛化能力。受最新研究启发（该研究表明受限域中的LoRA模块可构成具有语义意义且可插值的空间），我们提出LoRWeB这一新方法，通过动态组合学习到的变换基元，在推理阶段为每个类比任务定制模型——通俗而言即"在LoRA空间中选择合适点位"。我们引入两个核心组件：（1）可学习的LoRA基模块组，用于张成不同视觉变换的空间；（2）轻量级编码器，根据输入类比对动态选择并加权这些基LoRA模块。综合评估表明，我们的方法实现了最先进性能，并显著提升了对未见视觉变换的泛化能力。研究结果证明，LoRA基分解是实现灵活视觉操控的有效方向。代码与数据详见：https://research.nvidia.com/labs/par/lorweb

English

Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet {a, a', b}, the goal is to generate b' such that a : a' :: b : b'. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb