로컬 랜덤 액세스 시퀀스 모델링을 통한 3D 장면 이해

초록

단일 이미지에서의 3D 장면 이해는 컴퓨터 비전 분야에서 그래픽스, 증강 현실, 로보틱스 등 다양한 하위 응용 분야에 있어 핵심적인 문제입니다. 확산 기반 모델링 접근법이 유망한 성과를 보여주고 있지만, 특히 복잡한 실제 세계 시나리오에서 객체와 장면의 일관성을 유지하는 데 어려움을 겪는 경우가 많습니다. 이러한 한계를 해결하기 위해, 우리는 지역 패치 양자화와 무작위 순서 시퀀스 생성을 사용하는 Local Random Access Sequence(LRAS) 모델링이라는 자기회귀적 생성 접근법을 제안합니다. 광학 흐름을 3D 장면 편집을 위한 중간 표현으로 활용함으로써, 우리의 실험은 LRAS가 최신의 새로운 시점 합성 및 3D 객체 조작 능력을 달성함을 보여줍니다. 더 나아가, 우리의 프레임워크는 시퀀스 설계의 간단한 수정을 통해 자기 지도 깊이 추정으로 자연스럽게 확장될 수 있음을 보여줍니다. 여러 3D 장면 이해 작업에서 강력한 성능을 달성함으로써, LRAS는 차세대 3D 비전 모델을 구축하기 위한 통합적이고 효과적인 프레임워크를 제공합니다.

English

3D scene understanding from single images is a pivotal problem in computer vision with numerous downstream applications in graphics, augmented reality, and robotics. While diffusion-based modeling approaches have shown promise, they often struggle to maintain object and scene consistency, especially in complex real-world scenarios. To address these limitations, we propose an autoregressive generative approach called Local Random Access Sequence (LRAS) modeling, which uses local patch quantization and randomly ordered sequence generation. By utilizing optical flow as an intermediate representation for 3D scene editing, our experiments demonstrate that LRAS achieves state-of-the-art novel view synthesis and 3D object manipulation capabilities. Furthermore, we show that our framework naturally extends to self-supervised depth estimation through a simple modification of the sequence design. By achieving strong performance on multiple 3D scene understanding tasks, LRAS provides a unified and effective framework for building the next generation of 3D vision models.

로컬 랜덤 액세스 시퀀스 모델링을 통한 3D 장면 이해

3D Scene Understanding Through Local Random Access Sequence Modeling

초록

Support