필요할 때만 조정하기: 역추적을 통한 대규모 언어 모델의 유연한 조정

초록

대규모 언어 모델(LLMs)은 다양한 생성 작업에서 뛰어난 성능을 달성했습니다. 그러나 이를 원하는 행동과 효과적으로 정렬시키는 것은 여전히 중요한 과제로 남아 있습니다. 활성화 조정(Activation Steering)은 추론 단계에서 LLM의 활성화를 직접 수정하여 원하는 행동과 응답을 정렬시키는 효과적이고 비용 효율적인 접근 방식으로, 고비용의 미세 조정을 피할 수 있습니다. 기존 방법들은 일반적으로 모든 생성에 무차별적으로 개입하거나 질문만을 기준으로 개입 여부를 결정하여, 개입 강도를 정확히 평가하는 데 한계가 있었습니다. 이를 해결하기 위해, 우리는 유연한 활성화 조정과 역추적(Flexible Activation Steering with Backtracking, FASB) 프레임워크를 제안합니다. 이 프레임워크는 생성 과정에서 LLM의 내부 상태를 추적하며 질문과 생성된 내용을 모두 고려하여 개입의 필요성과 강도를 동적으로 결정합니다. 또한, 원하는 행동에서 벗어난 것을 감지한 후 개입하는 것은 종종 너무 늦기 때문에, 벗어난 토큰을 수정하고 LLM을 원하는 행동으로 유도하기 위한 역추적 메커니즘을 추가로 제안합니다. TruthfulQA 데이터셋과 6개의 객관식 데이터셋에서의 광범위한 실험을 통해 우리의 방법이 기준선을 능가함을 입증했습니다. 우리의 코드는 https://github.com/gjw185/FASB에서 공개될 예정입니다.

English

Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant challenge. Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage, aligning their responses with the desired behaviors and avoiding the high cost of fine-tuning. Existing methods typically indiscriminately intervene to all generations or rely solely on the question to determine intervention, which limits the accurate assessment of the intervention strength. To this end, we propose the Flexible Activation Steering with Backtracking (FASB) framework, which dynamically determines both the necessity and strength of intervention by tracking the internal states of the LLMs during generation, considering both the question and the generated content. Since intervening after detecting a deviation from the desired behavior is often too late, we further propose the backtracking mechanism to correct the deviated tokens and steer the LLMs toward the desired behavior. Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. Our code will be released at https://github.com/gjw185/FASB.

필요할 때만 조정하기: 역추적을 통한 대규모 언어 모델의 유연한 조정

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

초록

Support