从数据到行为:训练前预测模型意外行为
From Data to Behavior: Predicting Unintended Model Behaviors Before Training
February 4, 2026
作者: Mengru Wang, Zhenqian Xu, Junfeng Fang, Yunzhi Yao, Shumin Deng, Huajun Chen, Ningyu Zhang
cs.AI
摘要
大型语言模型(LLMs)即使在没有明确线索或恶意内容的情况下,也可能从看似良性的训练数据中习得非预期的偏见。现有方法难以在微调前检测此类风险,导致事后评估成本高昂且效率低下。为应对这一挑战,我们提出Data2Behavior新任务——在模型训练前预测其非预期行为。同时提出轻量级方法MDF(基于数据特征操控),该方法通过候选数据的均值表征进行数据摘要,并将其注入基础模型的前向传播过程,使数据中的潜在统计信号能够塑造模型激活状态,从而在不更新任何参数的情况下揭示潜在偏见与安全风险。MDF仅需消耗微调所需GPU资源的约20%,即可实现可靠预测。在Qwen3-14B、Qwen2.5-32B-Instruct和Gemma-3-12b-it上的实验证实,MDF能有效预测非预期行为,并为预训练阶段的脆弱性分析提供洞见。
English
Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.