FashionLens：通過任務自適應學習邁向多功能時尚圖像檢索

摘要

時尚影像檢索是現代電子商務系統的基石。在實際應用中，一個能夠支援多樣化查詢格式與搜尋意圖的統一框架備受期待。然而，現有方法多聚焦於狹義的檢索任務，未能充分涵蓋此類多樣性。因此，本研究旨在開發一個能夠處理多種實際時尚檢索場景的統一框架，實現真正通用的時尚影像檢索。為建立資料基礎，我們首先引入U-FIRE，一個整合零散時尚資料集為統一集合的全面基準，並輔以兩個手動建構的資料集以測試泛化能力。在此基礎上，我們提出FashionLens，一個基於多模態大型語言模型的統一框架。為處理分歧的匹配目標，我們設計了提案引導的球面查詢校準器，透過自適應球面線性插值動態將查詢表示轉換至任務對齊的度量空間。此外，為緩解不同任務複雜度與資料規模所導致的優化失衡，我們開發了梯度引導的自適應取樣策略，根據即時學習難度與資料規模先驗自動重新加權任務。在U-FIRE上的實驗顯示，FashionLens在多樣檢索場景中達到最先進效能，並對未見過的任務展現強健泛化能力。資料與程式碼已公開於 https://github.com/haokunwen/FashionLens。

English

Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation. Additionally, to mitigate the optimization imbalance caused by varying task complexities and data scales, we develop a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior. Experiments on U-FIRE show that FashionLens achieves state-of-the-art performance across diverse retrieval scenarios and generalizes robustly to unseen tasks. The data and code are publicly released at https://github.com/haokunwen/FashionLens.