LLM의 자체 점검 및 개선을 위한 내부 흐름 특징

초록

대규모 언어 모델은 제공된 맥락에 충실하지 않으면서도 유창한 답변을 생성할 수 있으며, 많은 안전장치들은 생성 후 외부 검증이나 별도의 판단 모듈에 의존합니다. 본 연구에서는 고정된 블록 간 모니터링 경계에서 깊이별 동역학을 통해 의사 결정 형성을 감사하는 내부 흐름 서명을 제안합니다. 이 방법은 편향 중심 모니터링을 통해 토큰 단위 운동을 안정화한 후, 각 깊이 창 내에서 상위 토큰과 그 경쟁 토큰들로 구성된 간결한 이동 판독 정렬 부분 공간에서 궤적을 요약합니다. 인접한 창 프레임은 직교 수송에 의해 정렬되어 깊이 비교가 가능한 수송된 단계 길이, 회전 각도, 그리고 창 내 기저 선택에 불변인 부분 공간 표류 요약치를 생성합니다. 이러한 서명을 기반으로 훈련된 경량 GRU 검증기는 기본 모델을 수정하지 않고 자체 점검을 수행합니다. 탐지뿐만 아니라, 검증기는 문제의 깊이 이벤트를 특정하고 표적 정제를 가능하게 합니다: 모델은 문제의 토큰으로 롤백하고 식별된 블록에서 비정상적인 수송 단계를 억제하면서 직교 잔차는 보존합니다. 결과적인 파이프라인은 내부 의사 결정 동역학으로부터 실행 가능한 문제 지역화와 낮은 오버헤드의 자체 점검을 제공합니다. 코드는 github.com/EavnJeong/Internal-Flow-Signatures-for-Self-Checking-and-Refinement-in-LLMs에서 확인할 수 있습니다.

English

Large language models can generate fluent answers that are unfaithful to the provided context, while many safeguards rely on external verification or a separate judge after generation. We introduce internal flow signatures that audit decision formation from depthwise dynamics at a fixed inter-block monitoring boundary. The method stabilizes token-wise motion via bias-centered monitoring, then summarizes trajectories in compact moving readout-aligned subspaces constructed from the top token and its close competitors within each depth window. Neighboring window frames are aligned by an orthogonal transport, yielding depth-comparable transported step lengths, turning angles, and subspace drift summaries that are invariant to within-window basis choices. A lightweight GRU validator trained on these signatures performs self-checking without modifying the base model. Beyond detection, the validator localizes a culprit depth event and enables a targeted refinement: the model rolls back to the culprit token and clamps an abnormal transported step at the identified block while preserving the orthogonal residual. The resulting pipeline provides actionable localization and low-overhead self-checking from internal decision dynamics. Code is available at github.com/EavnJeong/Internal-Flow-Signatures-for-Self-Checking-and-Refinement-in-LLMs.

LLM의 자체 점검 및 개선을 위한 내부 흐름 특징

Internal Flow Signatures for Self-Checking and Refinement in LLMs

초록

Support