幾何学的アライメント税：科学基盤モデルにおけるトークン化と連続的幾何学の対峙

要旨

生物学と物理学における基盤モデルは予測精度を最適化するが、その内部表現はモデル化するシステムの連続的な幾何学的構造を体系的に保存できない。根本原因は「幾何学的整合性コスト」——連続多様体を離散的なカテゴリカルボトルネックに強制的に通す際に生じる本質的代償——にある。合成力学系における制御実験により、同一エンコーダーで交差エントロピーを連続出力層に置換すると幾何学的歪みが最大8.5倍低減される一方、学習済みコードブックでは再構成精度の改善にもかかわらず量子化の精密化が幾何学的特性を劣化させる非単調的二重拘束が観測された。連続目的関数では3種のアーキテクチャ差は1.3倍だが、離散トークン化下では3,000倍に拡大する。レート歪み理論とMINEを用いた14の生物基盤モデルの評価から、3つの失敗様式（局所‐大域的分離・表現的圧縮・幾何学的空虚性）を同定した。制御実験により、Evo 2の実DNAデータにおける逆相補鎖ロバスト性が学習された対称性ではなく保存された塩基配列組成を反映することを確認。全てのモデルは低歪み・高相互情報量・大域的整合性を同時に達成できていない。

English

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2's reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

幾何学的アライメント税：科学基盤モデルにおけるトークン化と連続的幾何学の対峙

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

要旨

Support