현미경 아래의 데이터 다양체

초록

딥러닝에서 이론과 실제 사이에는 상당한 격차가 존재한다. 일반화 및 근사 오차 한계는 종종 단순화된 모델에 대해 도출되거나 너무 느슨하여 유의미한 정보를 제공하지 못한다. 많은 연구가 다양체 가설과 내재 차원, 곡률, 도달 거리와 같은 기하학적 정규성에 의존한다. 진전을 위해서는 데이터-다양체 기하학에 대한 통찰력과 적절한 벤치마크가 필요하지만, 기존 선택지는 알려진 기하학을 가지나 적용 가능성이 제한적인 분석적 다양체와, 기하학이 대략적으로만 추정 가능한 실제 세계 데이터셋 사이에서 양극화되어 있다. 우리는 데이터 기하학 연구를 위한 벤치마킹 프레임워크를 소개한다. dSprites와 COIL-20을 추가 변환 차원과 조밀한 축 정렬 샘플링으로 재구성 및 확장하고, 이를 유한 차분 추정기와 결합하여 범용 추정기가 신뢰할 수 없거나 배포하기 어려운 영역에서 거의 실제 값에 가까운 정확도로 곡률, 도달 거리, 부피를 복원한다. 이 프레임워크는 기하학적 추정기를 위한 교정 환경이자 이론적 가정을 탐구하기 위한 실험 도구로서 유용한 통제된 테스트베드로 설계되었다. 그 사용법을 설명하기 위해, Genovese 등과 Fefferman 등의 한계의 스케일링 행동을 평가하고 β-VAE의 층별 기하학을 추적하는 두 가지 응용 연구를 제시하며, 현재 한계의 행동과 미래 이론을 안내하고 검증하기 위한 통제된 벤치마크의 가치를 강조한다. 참조 구현은 https://github.com/koulakis/manifold-microscope에서 확인할 수 있다.

English

A significant gap exists between theory and practice in deep learning. Generalization and approximation error bounds are often derived for simplified models or are too loose to be informative. Many rely on the manifold hypothesis and on geometric regularity such as intrinsic dimension, curvature, and reach. Progress requires insight into data-manifold geometry and suitable benchmarks, yet existing options are polarized: analytic manifolds with known geometry but limited applicability, or real-world datasets where geometry is only coarsely estimable. We introduce a benchmarking framework for studying data geometry. We repurpose and extend dSprites and COIL-20 with additional transformation dimensions and dense, axis-aligned sampling, and pair them with finite-difference estimators that recover curvature, reach, and volume at near-ground-truth accuracy in a regime where general-purpose estimators are unreliable or difficult to deploy. The framework is intended as a controlled testbed, useful as a calibration environment for geometric estimators and a sandbox for probing theoretical assumptions. To illustrate its use, we present two application studies, namely assessing the scaling behavior of the bounds of Genovese et al. and Fefferman et al., and tracking the layer-wise geometry of a β-VAE, highlighting the behavior of current bounds and the value of controlled benchmarks for guiding and validating future theory. A reference implementation is available at https://github.com/koulakis/manifold-microscope.