TideGS: 아웃오브코어 최적화를 통한 10억 개 이상의 3D 가우시안 스플래팅 프리미티브의 확장 가능한 학습

초록

10억 개 프리미티브(primitive) 규모에서 3D 가우시안 스플래팅(3DGS)을 학습하는 것은 근본적으로 메모리 제약적이다. 각 가우시안 프리미티브는 큰 속성 벡터를 가지며, 전체 파라미터 테이블은 빠르게 GPU 용량을 초과하여, 기존 시스템은 일반적인 단일 GPU 하드웨어에서 수천만 개의 가우시안으로 제한된다. 우리는 3DGS 학습이 본질적으로 희소하고 궤적 조건적(trajectory-conditioned)임을 관찰했다. 즉, 각 반복(iteration)에서 현재 카메라 배치에 보이는 가우시안만 활성화되므로, GPU 메모리는 영구적인 파라미터 저장소가 아닌 작업 세트 캐시(working-set cache) 역할을 할 수 있다. 이러한 통찰을 바탕으로, 우리는 SSD-CPU-GPU 계층 전반에 걸쳐 파라미터를 관리하는 아웃오브코어(out-of-core) 학습 프레임워크인 TideGS를 도입한다. 이 프레임워크는 세 가지 시너지 기술, 즉 SSD 정렬 공간 지역성을 위한 블록 가상화 지오메트리, I/O와 계산을 중첩시키는 계층적 비동기 파이프라인, 그리고 반복 간 증분 작업 세트 델타(incremental working-set deltas)만 전송하는 궤적 적응 차등 스트리밍(trajectory-adaptive differential streaming)을 활용한다. 실험 결과, TideGS는 단일 24GB GPU에서 10억 개 이상의 가우시안으로 학습을 가능하게 하면서, 대규모 장면에서 평가된 단일 GPU 기준선 중 최고의 재구성 품질을 달성하며, 기존 아웃오브코어 기준선(예: 약 1억 개의 가우시안) 및 표준 인메모리 학습(예: 약 1100만 개의 가우시안)을 능가하는 확장성을 보여준다.

English

Training 3D Gaussian Splatting (3DGS) at billion-primitive scale is fundamentally memory-bound: each Gaussian primitive carries a large attribute vector, and the aggregate parameter table quickly exceeds GPU capacity, limiting prior systems to tens of millions of Gaussians on commodity single-GPU hardware. We observe that 3DGS training is inherently sparse and trajectory-conditioned: each iteration activates only the Gaussians visible from the current camera batch, so GPU memory can serve as a working-set cache rather than a persistent parameter store. Building on this insight, we introduce TideGS, an out-of-core training framework that manages parameters across an SSD-CPU-GPU hierarchy via three synergistic techniques: block-virtualized geometry for SSD-aligned spatial locality, a hierarchical asynchronous pipeline to overlap I/O with computation, and trajectory-adaptive differential streaming that transfers only incremental working-set deltas between iterations. Experiments show that TideGS enables training with over one billion Gaussians on a single 24 GB GPU while achieving the best reconstruction quality among evaluated single-GPU baselines on large-scale scenes, scaling beyond prior out-of-core baselines (e.g., approximately 100M Gaussians) and standard in-memory training (e.g., approximately 11M Gaussians).