LoRA의 가중치 기반으로 시각적 유사성 공간을 가로지르기

초록

시각적 유사성 학습은 텍스트 설명이 아닌 시범을 통해 이미지 조작을 가능하게 하여, 사용자가 언어로 표현하기 어려운 복잡한 변환을 지정할 수 있도록 합니다. 삼중항 {a, a', b}가 주어졌을 때, 목표는 a : a' :: b : b' 관계를 만족하는 b'를 생성하는 것입니다. 최근 방법론들은 단일 저순위 적응(LoRA) 모듈을 사용하여 텍스트-이미지 모델을 이 작업에 적용하지만, 고정된 적응 모듈 내에서 다양한 시각적 변환 공간을 포착하려는 시도는 일반화 능력을 제한하는 근본적인 한계에 직면합니다. 제한된 영역에서의 LoRA가 의미 있는 보간 가능한 의미 공간을 형성한다는 최근 연구에 영감을 받아, 우리는 추론 시점에 학습된 변환 기본 요소의 동적 구성을 통해 각 유사성 작업에 맞게 모델을 특화하는 새로운 접근법인 LoRWeB를 제안합니다. 즉, "LoRA들의 공간"에서 한 점을 선택하는 것입니다. 우리는 두 가지 핵심 구성 요소를 도입합니다: (1) 다양한 시각적 변환 공간을 포괄하기 위한 학습 가능한 LoRA 모듈 기저, 그리고 (2) 입력 유사성 쌍을 기반으로 이러한 기저 LoRA를 동적으로 선택하고 가중치를 부여하는 경량 인코더. 포괄적인 평가를 통해 우리의 접근법이 최첨단 성능을 달성하고 보이지 않는 시각적 변환에 대한 일반화를 크게 향상시킴을 입증합니다. 우리의 연구 결과는 LoRA 기저 분해가 유연한 시각적 조작을 위한 유망한 방향임을 시사합니다. 코드와 데이터는 https://research.nvidia.com/labs/par/lorweb에서 확인할 수 있습니다.

English

Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet {a, a', b}, the goal is to generate b' such that a : a' :: b : b'. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb

LoRA의 가중치 기반으로 시각적 유사성 공간을 가로지르기

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

초록

Support