과학 기반 모델에서의 기하학적 정렬 문제: 토큰화와 연속 기하학의 대비

초록

생물학 및 물리학 기반 모델은 예측 정확도를 최적화하지만, 그 내부 표현은 모델링 대상 시스템의 연속적 기하학 구조를 체계적으로 보존하지 못합니다. 우리는 근본 원인을 규명했는데, 바로 연속 다양체를 이산 범주적 병목 현상에 강제로 통과시키는 과정에서 발생하는 본질적 비용인 '기하학적 정렬 비용'입니다. 합성 동역학 시스템에 대한 통제된 절제 실험 결과, 동일한 인코더에서 교차 엔트로피를 연속 헤드로 대체할 경우 기하학적 왜곡이 최대 8.5배 감소하는 반면, 학습된 코드북은 재구성 성능 향상에도 불구하고 정제된 양자화가 기하학 구조를 악화시키는 비단조적 이중 구속 현상을 보입니다. 연속 목적 함수 하에서는 세 가지 아키텍처 간 차이가 1.3배에 불과했으나, 이산 토큰화 조건에서는 3,000배까지 차이가 벌어졌습니다. 14개 생물학 기반 모델을 율-왜곡 이론과 MINE으로 평가한 결과, 세 가지 실패 양상을 확인했습니다: 국지적-글로벌 분리, 표현 압축, 기하학적 공허입니다. 통제 실험을 통해 Evo 2의 실제 DNA에 대한 역상보적 강건성이 학습된 대칭성이 아닌 보존된 서열 구성에서 비롯됨을 입증했습니다. 어떤 모델도 낮은 왜곡, 높은 상호 정보량, 글로벌 일관성을 동시에 달성하지 못했습니다.

English

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2's reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

과학 기반 모델에서의 기하학적 정렬 문제: 토큰화와 연속 기하학의 대비

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

초록

Support