FashionLens：面向多功能时尚图像检索的任务自适应学习

摘要

时尚图像检索是现代电子商务系统的基石。在实际应用中，亟需一种能够支持多样化查询格式与搜索意图的统一框架。然而，现有方法聚焦于狭窄的检索任务，未能充分捕捉这种多样性。因此，本研究旨在开发一个能够处理多种真实时尚检索场景的统一框架，实现真正通用的时尚图像检索。为奠定数据基础，我们首先提出了U-FIRE基准数据集，将碎片化的时尚数据集整合为统一集合，并补充了两个人工标注数据集以测试泛化能力。基于此，我们构建了FashionLens框架——一种基于多模态大语言模型的统一方案。为应对差异化的匹配目标，我们设计了提案引导的球形查询校准器，通过自适应球形线性插值将查询表示动态转移到任务对齐的度量空间中。此外，为缓解不同任务复杂度与数据规模造成的优化失衡，我们提出了梯度引导的自适应采样策略，根据实时学习难度与数据规模先验自动重加权任务。在U-FIRE上的实验表明，FashionLens在多种检索场景下均达到最先进性能，并能稳健泛化至未见任务。相关数据与代码已开源发布于https://github.com/haokunwen/FashionLens。

English

Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation. Additionally, to mitigate the optimization imbalance caused by varying task complexities and data scales, we develop a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior. Experiments on U-FIRE show that FashionLens achieves state-of-the-art performance across diverse retrieval scenarios and generalizes robustly to unseen tasks. The data and code are publicly released at https://github.com/haokunwen/FashionLens.