仅解码器大语言模型如何感知用户?重新思考注意力掩码在用户表征学习中的应用
How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning
February 11, 2026
作者: Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Yang Chen, Xiaotong Lin, Wuliang Huang, Ziyi Gao, Xing Fu, Yu Cheng, Weiqiang Wang
cs.AI
摘要
仅解码器架构的大语言模型正日益被用作行为编码器以进行用户表征学习,然而注意力掩码机制对用户嵌入质量的影响仍待深入探索。本研究在整合长周期异构用户行为的大规模真实支付宝数据上,通过统一对比学习框架系统探究了因果掩码、混合掩码及双向掩码的作用。针对从因果掩码向双向掩码过渡时的训练动态优化问题,我们提出梯度引导软掩码技术——一种基于梯度的预预热方法,配合线性调度器在优化过程中逐步开放未来注意力。在涵盖预测、偏好和营销敏感度任务的9个工业级用户认知基准测试中,相比因果掩码、混合掩码及仅使用调度器的基线方法,本方案始终能产生更稳定的训练过程和更优质的双向表征,同时保持与解码器预训练的兼容性。总体而言,我们的发现揭示了掩码设计和训练过渡策略在适配仅解码器大语言模型以实现高效用户表征学习中的重要性。代码已开源:https://github.com/JhCircle/Deepfind-GGSM。
English
Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at https://github.com/JhCircle/Deepfind-GGSM.