必要時引導：通過回溯靈活駕馭大型語言模型

摘要

大型語言模型（LLMs）在眾多生成任務中展現了卓越的性能。然而，如何有效地將其與期望行為對齊仍是一大挑戰。激活引導是一種高效且成本低廉的方法，它直接在推理階段修改LLMs的激活值，從而使其回應與期望行為保持一致，並避免了微調的高昂成本。現有方法通常不加區別地對所有生成進行干預，或僅依賴問題來決定干預時機，這限制了對干預強度的精確評估。為此，我們提出了帶有回溯機制的靈活激活引導（FASB）框架，該框架通過追蹤LLMs在生成過程中的內部狀態，結合問題與生成內容，動態決定干預的必要性及強度。由於在檢測到偏離期望行為後再進行干預往往為時已晚，我們進一步提出了回溯機制，以糾正偏離的詞彙並引導LLMs朝向期望行為。在TruthfulQA數據集及六個多選題數據集上的大量實驗表明，我們的方法優於基準模型。我們的代碼將發佈於https://github.com/gjw185/FASB。

English

Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant challenge. Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage, aligning their responses with the desired behaviors and avoiding the high cost of fine-tuning. Existing methods typically indiscriminately intervene to all generations or rely solely on the question to determine intervention, which limits the accurate assessment of the intervention strength. To this end, we propose the Flexible Activation Steering with Backtracking (FASB) framework, which dynamically determines both the necessity and strength of intervention by tracking the internal states of the LLMs during generation, considering both the question and the generated content. Since intervening after detecting a deviation from the desired behavior is often too late, we further propose the backtracking mechanism to correct the deviated tokens and steer the LLMs toward the desired behavior. Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. Our code will be released at https://github.com/gjw185/FASB.

必要時引導：通過回溯靈活駕馭大型語言模型

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

摘要

Support