ChatPaper.aiChatPaper

以相機思維:面向相機中心理解與生成的統一多模態模型

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

October 9, 2025
作者: Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy
cs.AI

摘要

以相机为中心的理解与生成是空间智能的两大基石,然而它们通常被孤立研究。我们提出了Puffin,一个统一的多模态相机中心模型,它沿相机维度扩展了空间感知能力。Puffin集成了语言回归与基于扩散的生成技术,能够从任意视角解读并创造场景。为了弥合相机与视觉-语言之间的模态鸿沟,我们引入了一种新颖的范式,将相机视作语言,实现“用相机思考”。这一方法引导模型在几何上下文中推理时,将空间定位的视觉线索与摄影术语对齐。Puffin在Puffin-4M上进行了训练,这是一个包含400万视觉-语言-相机三元组的大规模数据集。我们同时整合了全局相机参数与像素级相机映射,实现了灵活且可靠的空间生成。实验表明,Puffin在相机中心生成与理解任务上超越了专门模型。通过指令微调,Puffin能够泛化至多样化的跨视角任务,如空间想象、世界探索及摄影指导。我们将公开代码、模型、数据集构建流程及基准测试,以推动多模态空间智能研究的进展。
English
Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.
PDF1163October 13, 2025