几何对齐税：科学基础模型中的离散标记化与连续几何之争

摘要

生物学与物理学基础模型虽能优化预测精度，但其内部表征系统性地无法保持所建模系统的连续几何结构。我们揭示了根本原因：几何对齐税——即强制将连续流形通过离散分类瓶颈时产生的固有代价。在合成动力系统上的受控消融实验表明，将交叉熵损失替换为相同编码器上的连续输出头，可使几何失真降低达8.5倍；而学习型码本则呈现非单调的双重约束现象，即更精细的量化在改善重建效果的同时反而会恶化几何保持。在连续目标下，三种架构的差异仅为1.3倍；而在离散标记化条件下，其差异扩大至3000倍。通过率失真理论与MINE指标评估14个生物基础模型，我们识别出三种失效机制：局部-全局解耦、表征压缩与几何空泛。受控实验证实，Evo 2模型在真实DNA数据上表现出的反向互补稳健性反映的是保守序列组成，而非习得的对称性。所有模型均未能同时实现低失真、高互信息与全局连贯性。

English

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2's reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

几何对齐税：科学基础模型中的离散标记化与连续几何之争

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

摘要

Support