카메라와 함께 사고하기: 카메라 중심 이해 및 생성을 위한 통합 멀티모달 모델

초록

카메라 중심의 이해와 생성은 공간 지능의 두 기둥이지만, 일반적으로 별개로 연구된다. 본 연구에서는 카메라 차원을 따라 공간 인식을 확장하는 통합형 카메라 중심 다중모달 모델인 Puffin을 제안한다. Puffin은 언어 회귀와 확산 기반 생성을 통합하여 임의의 시점에서 장면을 해석하고 생성한다. 카메라와 시각-언어 간의 모달리티 격차를 해소하기 위해, 카메라를 언어로 취급하여 카메라를 통해 사고할 수 있게 하는 새로운 패러다임을 도입한다. 이를 통해 모델은 기하학적 맥락을 추론하면서 공간적으로 근거한 시각적 단서를 사진술 용어와 정렬하도록 유도된다. Puffin은 400만 개의 시각-언어-카메라 삼중항으로 구성된 대규모 데이터셋인 Puffin-4M에서 학습된다. 전역 카메라 매개변수와 픽셀 단위 카메라 맵을 모두 통합하여 유연하고 신뢰할 수 있는 공간 생성을 가능하게 한다. 실험 결과, Puffin은 카메라 중심 생성 및 이해를 위한 특화된 모델들을 능가하는 성능을 보여준다. 명령어 튜닝을 통해 Puffin은 공간 상상, 세계 탐험, 사진 촬영 안내 등 다양한 교차 시점 작업으로 일반화된다. 본 연구는 코드, 모델, 데이터셋 파이프라인 및 벤치마크를 공개하여 다중모달 공간 지능 연구를 발전시키고자 한다.

English

Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.

카메라와 함께 사고하기: 카메라 중심 이해 및 생성을 위한 통합 멀티모달 모델

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

초록

Support