必要时引导：通过回溯灵活操控大型语言模型

摘要

大型语言模型（LLMs）在众多生成任务中展现了卓越的性能。然而，如何有效地将其与期望行为对齐仍是一个重大挑战。激活引导是一种高效且经济的方法，它直接在推理阶段修改LLMs的激活状态，使其响应与期望行为保持一致，从而避免了微调的高昂成本。现有方法通常不加区分地对所有生成内容进行干预，或仅依赖问题本身来决定干预，这限制了对干预强度准确评估的能力。为此，我们提出了带有回溯机制的灵活激活引导（FASB）框架，该框架通过跟踪LLMs在生成过程中的内部状态，综合考虑问题及生成内容，动态决定干预的必要性和强度。由于在检测到偏离期望行为后再进行干预往往为时已晚，我们进一步引入了回溯机制，以纠正偏离的标记，引导LLMs回归期望行为。在TruthfulQA数据集及六个多项选择题数据集上的广泛实验表明，我们的方法优于基线模型。代码将在https://github.com/gjw185/FASB 发布。

English

Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant challenge. Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage, aligning their responses with the desired behaviors and avoiding the high cost of fine-tuning. Existing methods typically indiscriminately intervene to all generations or rely solely on the question to determine intervention, which limits the accurate assessment of the intervention strength. To this end, we propose the Flexible Activation Steering with Backtracking (FASB) framework, which dynamically determines both the necessity and strength of intervention by tracking the internal states of the LLMs during generation, considering both the question and the generated content. Since intervening after detecting a deviation from the desired behavior is often too late, we further propose the backtracking mechanism to correct the deviated tokens and steer the LLMs toward the desired behavior. Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. Our code will be released at https://github.com/gjw185/FASB.

必要时引导：通过回溯灵活操控大型语言模型

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

摘要

Support