ChatPaper.aiChatPaper

僅解碼器大型語言模型如何感知用戶?重新思考注意力遮罩在用戶表徵學習中的應用

How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

February 11, 2026
作者: Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Yang Chen, Xiaotong Lin, Wuliang Huang, Ziyi Gao, Xing Fu, Yu Cheng, Weiqiang Wang
cs.AI

摘要

僅使用解碼器的大型語言模型正日益被用作行為編碼器以進行用戶表徵學習,然而注意力遮罩對用戶嵌入質量的影響仍未得到充分探討。本研究在整合長期異構用戶行為的大規模真實支付寶數據上,系統性地探討了因果式、混合式及雙向注意力遮罩在統一對比學習框架中的效果。為改善從因果式轉向雙向注意力時的訓練動態,我們提出梯度引導軟遮罩技術——一種在線性調度器實施前應用的基於梯度的預熱方法,可在優化過程中逐步開放未來注意力。通過在涵蓋預測、偏好及營銷敏感度任務的9個工業級用戶認知基準測試中評估,相較於因果式、混合式及僅使用調度器的基線方法,本方法始終能產生更穩定的訓練過程和更高質量的雙向表徵,同時保持與解碼器預訓練的兼容性。總體而言,我們的研究結果凸顯了遮罩設計和訓練過渡在適應僅解碼器大型語言模型以實現有效用戶表徵學習中的重要性。代碼已開源於:https://github.com/JhCircle/Deepfind-GGSM。
English
Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at https://github.com/JhCircle/Deepfind-GGSM.
PDF222February 13, 2026