ChatPaper.aiChatPaper

ssToken:面向大语言模型微调的自调制与语义感知令牌选择

ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

October 21, 2025
作者: Xiaohan Qin, Xiaoxing Wang, Ning Liao, Cancheng Zhang, Xiangdong Zhang, Mingquan Feng, Jingzhi Wang, Junchi Yan
cs.AI

摘要

数据质量在提升大规模语言模型(LLMs)的监督微调(SFT)中扮演着关键角色,而细粒度的令牌级数据选择已成为一个颇具前景的研究方向。尽管现有令牌级选择方法在实证中表现出色,但它们普遍存在两大局限:(1)需要训练或访问额外的参考模型;(2)仅依赖损失信息进行令牌选择,无法充分保留那些不被基于损失的指标青睐但语义重要的令牌。为应对这些挑战,我们提出了ssToken,一种自调节且语义感知的令牌选择方法。ssToken利用易于获取的历史模型计算当前模型与历史模型间的每令牌损失差异,作为自调节信号,使模型能沿其优化轨迹自适应地选择令牌,而非如先前工作那样依赖于离线训练参考模型的额外损失。此外,我们引入了一种基于注意力的语义感知令牌重要性评估指标,与基于损失的选择正交,提供互补的语义信息以实现更有效的筛选。跨不同模型家族和规模的广泛实验表明,自调节选择和语义感知选择单独使用均优于全数据微调,而它们的整合——ssToken——实现了协同增益,进一步超越了先前的令牌级选择方法,在保持训练效率的同时带来了性能提升。
English
Data quality plays a critical role in enhancing supervised fine-tuning (SFT) for large language models (LLMs), and token-level data selection has emerged as a promising direction for its fine-grained nature. Despite their strong empirical performance, existing token-level selection methods share two key limitations: (1) requiring training or accessing an additional reference model, and (2) relying solely on loss information for token selection, which cannot well preserve semantically important tokens that are not favored by loss-based metrics. To address these challenges, we propose ssToken, a Self-modulated and Semantic-aware Token Selection approach. ssToken leverages readily accessible history models to compute the per-token loss difference with the current model, which serves as a self-modulated signal that enables the model to adaptively select tokens along its optimization trajectory, rather than relying on excess loss from an offline-trained reference model as in prior works. We further introduce a semantic-aware, attention-based token importance estimation metric, orthogonal to loss-based selection and providing complementary semantic information for more effective filtering. Extensive experiments across different model families and scales demonstrate that both self-modulated selection and semantic-aware selection alone outperform full-data fine-tuning, while their integration--ssToken--achieves synergistic gains and further surpasses prior token-level selection methods, delivering performance improvements while maintaining training efficiency.
PDF112October 22, 2025