カメラを用いた思考：カメラ中心の理解と生成のための統合マルチモーダルモデル

要旨

カメラ中心の理解と生成は空間知能の二つの基盤であるが、これらは通常個別に研究されている。本論文では、カメラ次元に沿って空間認識を拡張する統一的なカメラ中心マルチモーダルモデルであるPuffinを提案する。Puffinは言語回帰と拡散ベースの生成を統合し、任意の視点からシーンを解釈および生成する。カメラと視覚言語の間のモダリティギャップを埋めるため、カメラを言語として扱う新たなパラダイムを導入し、カメラを用いた思考を可能にする。これにより、モデルは幾何学的文脈を推論しながら、空間的に根ざした視覚的手がかりを写真用語と整合させる。Puffinは、400万の視覚-言語-カメラのトリプレットからなる大規模データセットPuffin-4Mで訓練される。グローバルなカメラパラメータとピクセル単位のカメラマップを組み込むことで、柔軟で信頼性の高い空間生成を実現する。実験により、Puffinがカメラ中心の生成と理解において専門モデルを上回る性能を示すことが確認された。指示チューニングにより、Puffinは空間的想像、世界探索、写真ガイダンスなどの多様なクロスビュータスクに汎化する。コード、モデル、データセットパイプライン、およびベンチマークを公開し、マルチモーダル空間知能研究の進展に貢献する。

English

Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.

カメラを用いた思考：カメラ中心の理解と生成のための統合マルチモーダルモデル

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

要旨

Support