大型語言模型遇上極標籤分類:擴展性與多模態框架
Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework
November 17, 2025
作者: Diego Ortego, Marlon Rodríguez, Mario Almagro, Kunal Dahiya, David Jiménez, Juan C. SanMiguel
cs.AI
摘要
基礎模型已在多個領域徹底改變了人工智慧,但其在極端多標籤分類(XMC)中的變革潛力仍未被充分開發。XMC中的查詢需從極大標籤空間中關聯相關標籤,這要求必須在效率與效能之間取得平衡。因此,近期許多研究透過小型僅編碼器轉換器架構學習的嵌入向量,將XMC高效轉化為最大內積搜索問題。本文針對XMC提出兩個關鍵方向:如何有效運用更大的僅解碼器模型,以及如何在保持計算效率的同時利用視覺資訊。我們證明這兩者分別在XMC中具有關鍵作用,且能結合使用以提升效能。實驗顯示,數十億參數規模的解碼器可在控制計算開銷的同時實現顯著改進。此外,我們提出的視覺增強型極端多標籤學習框架(ViXML)透過對每張圖像提取單一嵌入向量,高效整合基礎視覺模型,在限制計算量增長的同時釋放多模態潛力。值得注意的是,搭載小型編碼器的ViXML在多數情況下優於純文本解碼器,印證「一圖勝過十億參數」的效能突破。最後,我們擴展現有純文本數據集以利用視覺元數據,並公開供後續基準測試使用。在四個公開純文本數據集及其對應圖像增強版本上的全面實驗驗證了本方法的有效性,於最大數據集上P@1指標較先前最佳成果提升達+8.21%。ViXML程式碼已開源於https://github.com/DiegoOrtego/vixml。
English
Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals' effectiveness, surpassing previous state-of-the-art by up to +8.21\% in P@1 on the largest dataset. ViXML's code is available at https://github.com/DiegoOrtego/vixml.