潜在思考の形式化：LLMにおける思考表現の四つの公理

要旨

本稿では、大規模言語モデル（LLMs）における潜在思考表現のための公理的評価フレームワークを提案する。本フレームワークは、下流ベンチマークスコアから独立した指標から構成され、ベンチマーク精度では隠蔽される表現障害を明らかにする。既存の評価は表現品質とモデル容量を混同しているため、障害の原因を、それを処理するモデルではなく表現に帰属させることができない。我々は4つの機能的公理（因果性、最小性、分離性、安定性）を形式化し、それぞれについて、下流精度とは独立に表現から直接計算される量的尺度を定義する。23の推論タスク（例：空間推論、事実QA）にわたってオープンウェイトLLMを監査した。その結果、いずれの候補も4つの公理を同時に満たさないこと、表現はタスクタイプを確実に区別できるが、同一タスク内の2つの質問は区別できないこと、そして表現は入力埋め込みに既に存在する情報を超えてほとんど情報を符号化しないことが明らかになった。この障害は高密度、推論蒸留、およびRL訓練モデルファミリーにわたって一貫しており、モデルサイズや訓練手順の性質ではなく、構造的なギャップであることを示している。

English

We introduce an axiomatic evaluation framework for latent thought representations in LLMs, comprising metrics that are independent of downstream benchmark scores and reveal representational failures that benchmark accuracy masks. Existing evaluations conflate representation quality with model capacity. Therefore, failures cannot be attributed to the representation rather than to the model that processes it. We formalize four functional axioms (Causality, Minimality, Separability, and Stability) and define a quantitative measure for each, computed directly on the representation independently of downstream accuracy. We audit open-weight LLMs across 23 reasoning tasks (e.g., Spatial Reasoning, Factual QA). We find that no candidate satisfies all four axioms simultaneously, that the representations distinguish task type reliably but cannot distinguish between two questions within the same task, and that the representations encode little information beyond what is already present in the input embedding. The failure is consistent across dense, reasoning-distilled, and RL-trained model families, indicating that the gap is structural rather than a property of model size or training procedure.