ChatPaper.aiChatPaper

ssToken:面向大語言模型微調的自調節與語義感知型令牌選擇機制

ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

October 21, 2025
作者: Xiaohan Qin, Xiaoxing Wang, Ning Liao, Cancheng Zhang, Xiangdong Zhang, Mingquan Feng, Jingzhi Wang, Junchi Yan
cs.AI

摘要

數據質量在提升大型語言模型(LLMs)的監督微調(SFT)中扮演著至關重要的角色,而基於詞元層級的數據選擇因其細粒度特性已成為一個頗具前景的研究方向。儘管現有的詞元級選擇方法在實證表現上頗為強勁,但它們普遍存在兩個主要限制:(1)需要訓練或依賴於一個額外的參考模型,(2)僅依賴於損失信息進行詞元選擇,這無法很好地保留那些不被基於損失的指標所青睞但語義上重要的詞元。為應對這些挑戰,我們提出了ssToken,一種自我調節且語義感知的詞元選擇方法。ssToken利用易於獲取的歷史模型來計算當前模型與之相比的每個詞元損失差異,作為一種自我調節信號,使模型能夠沿其優化軌跡自適應地選擇詞元,而非如先前工作那樣依賴於離線訓練的參考模型所產生的過剩損失。我們進一步引入了一種語義感知的、基於注意力的詞元重要性估計指標,該指標與基於損失的選擇正交,並提供互補的語義信息,以實現更有效的過濾。跨不同模型家族和規模的廣泛實驗表明,自我調節選擇和語義感知選擇各自均優於全數據微調,而它們的整合——ssToken——則實現了協同增益,進一步超越了先前的詞元級選擇方法,在保持訓練效率的同時,帶來了性能的提升。
English
Data quality plays a critical role in enhancing supervised fine-tuning (SFT) for large language models (LLMs), and token-level data selection has emerged as a promising direction for its fine-grained nature. Despite their strong empirical performance, existing token-level selection methods share two key limitations: (1) requiring training or accessing an additional reference model, and (2) relying solely on loss information for token selection, which cannot well preserve semantically important tokens that are not favored by loss-based metrics. To address these challenges, we propose ssToken, a Self-modulated and Semantic-aware Token Selection approach. ssToken leverages readily accessible history models to compute the per-token loss difference with the current model, which serves as a self-modulated signal that enables the model to adaptively select tokens along its optimization trajectory, rather than relying on excess loss from an offline-trained reference model as in prior works. We further introduce a semantic-aware, attention-based token importance estimation metric, orthogonal to loss-based selection and providing complementary semantic information for more effective filtering. Extensive experiments across different model families and scales demonstrate that both self-modulated selection and semantic-aware selection alone outperform full-data fine-tuning, while their integration--ssToken--achieves synergistic gains and further surpasses prior token-level selection methods, delivering performance improvements while maintaining training efficiency.
PDF112October 22, 2025