신뢰성 확보를 위한 컨텍스트 엔지니어링: 혼합 및 부적절한 컨텍스트 하에서의 Rescorla-Wagner 스티어링

초록

외부 컨텍스트를 통합하는 것은 대형 언어 모델(LLM)의 응답 품질을 크게 향상시킬 수 있습니다. 그러나 실제 세계의 컨텍스트는 종종 관련 정보와 불균형적으로 부적절한 내용이 혼합되어 있어 신뢰성 위험을 초래합니다. LLM은 혼합된 컨텍스트를 어떻게 처리하고 우선순위를 정할까요? 이를 연구하기 위해, 우리는 관련 정보와 부적절한 내용을 포함한 실제 세계의 컨텍스트와 쿼리를 짝지은 'Poisoned Context Testbed'를 도입했습니다. 동물의 연관 학습에서 영감을 받아, 우리는 신경과학의 Rescorla-Wagner(RW) 모델을 적응시켜 경쟁적인 컨텍스트 신호가 LLM 출력에 미치는 영향을 정량화했습니다. 우리의 적응 모델은 일관된 행동 패턴을 보여줍니다: LLM은 컨텍스트에서 덜 흔한 정보를 통합하려는 강한 경향을 보입니다. 이러한 취약성은 실제 세계 설정에서 해롭습니다. 소량의 부적절한 내용이 응답 품질을 크게 저하시킬 수 있기 때문입니다. 우리의 테스트베드에서의 실증적 평가는 이 취약성을 추가로 확인했습니다. 이를 해결하기 위해, 우리는 RW-Steering을 도입했습니다. 이는 모델이 내부적으로 부적절한 신호를 식별하고 무시할 수 있도록 하는 두 단계의 미세 조정 기반 접근법입니다. 다양한 컨텍스트 혼합에 걸친 광범위한 감독에 의존하는 기존 방법과 달리, RW-Steering은 부적절한 내용의 비율이 달라도 강력하게 일반화됩니다. 실험 결과, 우리의 최적 미세 조정 모델은 응답 품질을 39.8% 향상시키고 바람직하지 않은 행동 곡선을 역전시켜, RW-Steering이 실제 세계 사용에서 LLM 안전성을 개선하기 위한 강력하고 일반화 가능한 컨텍스트 엔지니어링 솔루션임을 입증했습니다.

English

Incorporating external context can significantly enhance the response quality of Large Language Models (LLMs). However, real-world contexts often mix relevant information with disproportionate inappropriate content, posing reliability risks. How do LLMs process and prioritize mixed context? To study this, we introduce the Poisoned Context Testbed, pairing queries with real-world contexts containing relevant and inappropriate content. Inspired by associative learning in animals, we adapt the Rescorla-Wagner (RW) model from neuroscience to quantify how competing contextual signals influence LLM outputs. Our adapted model reveals a consistent behavioral pattern: LLMs exhibit a strong tendency to incorporate information that is less prevalent in the context. This susceptibility is harmful in real-world settings, where small amounts of inappropriate content can substantially degrade response quality. Empirical evaluations on our testbed further confirm this vulnerability. To tackle this, we introduce RW-Steering, a two-stage finetuning-based approach that enables the model to internally identify and ignore inappropriate signals. Unlike prior methods that rely on extensive supervision across diverse context mixtures, RW-Steering generalizes robustly across varying proportions of inappropriate content. Experiments show that our best fine-tuned model improves response quality by 39.8% and reverses the undesirable behavior curve, establishing RW-Steering as a robust, generalizable context engineering solution for improving LLM safety in real-world use.

신뢰성 확보를 위한 컨텍스트 엔지니어링: 혼합 및 부적절한 컨텍스트 하에서의 Rescorla-Wagner 스티어링

Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts

초록

Support