FashionLens: 작업 적응 학습을 통한 다목적 패션 이미지 검색

초록

패션 이미지 검색은 현대 전자상거래 시스템의 핵심 요소이다. 다양한 질의 형식과 검색 의도를 지원하는 통합 프레임워크는 실무에서 매우 요구된다. 그러나 기존 접근법은 협소한 검색 작업에 초점을 맞추고 있어 이러한 다양성을 충분히 포착하지 못한다. 이에 본 연구에서는 다양한 현실적 패션 검색 시나리오를 처리할 수 있는 통합 프레임워크를 개발하여 진정으로 다재다능한 패션 이미지 검색을 달성하고자 한다. 데이터 기반을 구축하기 위해 먼저 U-FIRE를 소개한다. 이는 분산된 패션 데이터셋을 통합한 포괄적 벤치마크로, 일반화 테스트를 위한 수작업 큐레이션 데이터셋 두 개가 추가로 제공된다. 이를 바탕으로 멀티모달 대규모 언어 모델 기반의 통합 프레임워크인 FashionLens를 제안한다. 상이한 정합 목표를 처리하기 위해, 적응형 구형 선형 보간을 통해 질의 표현을 동적으로 작업 정렬 메트릭 공간으로 이동시키는 제안 기반 구형 질의 교정기를 설계한다. 또한 다양한 작업 복잡성과 데이터 규모로 인한 최적화 불균형을 완화하기 위해, 실시간 학습 난이도와 데이터 규모 사전 정보에 기반하여 작업을 자동 재가중하는 기울기 기반 적응형 샘플링 전략을 개발한다. U-FIRE 실험 결과, FashionLens는 다양한 검색 시나리오에서 최첨단 성능을 달성하고 보지 못한 작업에 대해 강건하게 일반화함을 보여준다. 데이터와 코드는 https://github.com/haokunwen/FashionLens에서 공개적으로 제공된다.

English

Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation. Additionally, to mitigate the optimization imbalance caused by varying task complexities and data scales, we develop a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior. Experiments on U-FIRE show that FashionLens achieves state-of-the-art performance across diverse retrieval scenarios and generalizes robustly to unseen tasks. The data and code are publicly released at https://github.com/haokunwen/FashionLens.