大規模言語モデルにおける自己検証と改良のための内部フロー特性

要旨

大規模言語モデルは、提供された文脈に忠実でない流暢な回答を生成する可能性があるが、多くの安全策は生成後の外部検証や別個の判定器に依存している。本論文では、固定されたブロック間監視境界における深さ方向のダイナミクスから意思決定形成を監査する内部フロー署名を提案する。本手法は、バイアス中心監視によりトークンレベルの運動を安定化させ、各深度ウィンドウ内の最上位トークンとその競合トークンから構築されたコンパクトな移動読み出し整列部分空間で軌跡を要約する。隣接するウィンドウフレームは直交輸送によって整列され、深度比較可能な輸送ステップ長、回転角、およびウィンドウ内の基底選択に不変な部分空間ドリフト要約を生成する。これらの署名で学習した軽量なGRU検証器は、基盤モデルを変更せずに自己チェックを実行する。検出を超えて、検証器は原因深度イベントを特定し、標的型改良を可能にする：モデルは原因トークンまでロールバックし、直交残差を保持しながら特定されたブロックで異常な輸送ステップをクランプする。結果として得られるパイプラインは、内部意思決定ダイナミクスからの実用的な局所化と低オーバーヘッドな自己チェックを提供する。コードはgithub.com/EavnJeong/Internal-Flow-Signatures-for-Self-Checking-and-Refinement-in-LLMsで公開されている。

English

Large language models can generate fluent answers that are unfaithful to the provided context, while many safeguards rely on external verification or a separate judge after generation. We introduce internal flow signatures that audit decision formation from depthwise dynamics at a fixed inter-block monitoring boundary. The method stabilizes token-wise motion via bias-centered monitoring, then summarizes trajectories in compact moving readout-aligned subspaces constructed from the top token and its close competitors within each depth window. Neighboring window frames are aligned by an orthogonal transport, yielding depth-comparable transported step lengths, turning angles, and subspace drift summaries that are invariant to within-window basis choices. A lightweight GRU validator trained on these signatures performs self-checking without modifying the base model. Beyond detection, the validator localizes a culprit depth event and enables a targeted refinement: the model rolls back to the culprit token and clamps an abnormal transported step at the identified block while preserving the orthogonal residual. The resulting pipeline provides actionable localization and low-overhead self-checking from internal decision dynamics. Code is available at github.com/EavnJeong/Internal-Flow-Signatures-for-Self-Checking-and-Refinement-in-LLMs.

大規模言語モデルにおける自己検証と改良のための内部フロー特性

Internal Flow Signatures for Self-Checking and Refinement in LLMs

要旨

Support