可信度情境工程:混合與不當情境下的Rescorla-Wagner導向
Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts
September 2, 2025
作者: Rushi Wang, Jiateng Liu, Cheng Qian, Yifan Shen, Yanzhou Pan, Zhaozhuo Xu, Ahmed Abbasi, Heng Ji, Denghui Zhang
cs.AI
摘要
融入外部上下文能顯著提升大型語言模型(LLMs)的回應品質。然而,現實世界中的上下文往往混雜著相關資訊與不成比例的不當內容,這帶來了可靠性風險。LLMs如何處理並優先考慮混合上下文?為研究此問題,我們引入了「污染上下文測試平台」,將查詢與包含相關及不當內容的現實世界上下文配對。受動物聯想學習的啟發,我們從神經科學中借鑒了Rescorla-Wagner(RW)模型,以量化競爭性上下文信號如何影響LLM的輸出。我們改編後的模型揭示了一致的行為模式:LLMs展現出強烈傾向於採納在上下文中較不普遍的信息。這種易感性在現實場景中是有害的,因為少量不當內容即可大幅降低回應品質。在我們測試平台上的實證評估進一步確認了這一脆弱性。為解決此問題,我們提出了RW-Steering,一種基於兩階段微調的方法,使模型能夠內部識別並忽略不當信號。與先前依賴於多樣化上下文混合中廣泛監督的方法不同,RW-Steering在不當內容比例變化的情況下展現出強大的泛化能力。實驗表明,我們最佳微調模型提升了39.8%的回應品質,並逆轉了不良行為曲線,確立了RW-Steering作為一種穩健、可泛化的上下文工程解決方案,用於提升LLM在現實應用中的安全性。
English
Incorporating external context can significantly enhance the response quality
of Large Language Models (LLMs). However, real-world contexts often mix
relevant information with disproportionate inappropriate content, posing
reliability risks. How do LLMs process and prioritize mixed context? To study
this, we introduce the Poisoned Context Testbed, pairing queries with
real-world contexts containing relevant and inappropriate content. Inspired by
associative learning in animals, we adapt the Rescorla-Wagner (RW) model from
neuroscience to quantify how competing contextual signals influence LLM
outputs. Our adapted model reveals a consistent behavioral pattern: LLMs
exhibit a strong tendency to incorporate information that is less prevalent in
the context. This susceptibility is harmful in real-world settings, where small
amounts of inappropriate content can substantially degrade response quality.
Empirical evaluations on our testbed further confirm this vulnerability. To
tackle this, we introduce RW-Steering, a two-stage finetuning-based approach
that enables the model to internally identify and ignore inappropriate signals.
Unlike prior methods that rely on extensive supervision across diverse context
mixtures, RW-Steering generalizes robustly across varying proportions of
inappropriate content. Experiments show that our best fine-tuned model improves
response quality by 39.8% and reverses the undesirable behavior curve,
establishing RW-Steering as a robust, generalizable context engineering
solution for improving LLM safety in real-world use.