ChatPaper.aiChatPaper

Pippo:從單張圖像生成高解析度多視角人體

Pippo: High-Resolution Multi-View Humans from a Single Image

February 11, 2025
作者: Yash Kant, Ethan Weber, Jin Kyu Kim, Rawal Khirodkar, Su Zhaoen, Julieta Martinez, Igor Gilitschenski, Shunsuke Saito, Timur Bagautdinov
cs.AI

摘要

我們提出了Pippo,一種生成模型,能夠從單張隨意拍攝的照片中產生一個人的1K分辨率密集環繞視頻。Pippo是一個多視圖擴散變壓器,不需要任何額外的輸入,例如已配適的參數模型或輸入圖像的相機參數。我們在沒有標題的30億人類圖像上對Pippo進行預訓練,並在工作室拍攝的人類上進行多視圖中期訓練和後期訓練。在中期訓練期間,為了快速吸收工作室數據集,我們對低分辨率下的幾個(最多48個)視圖進行降噪,並使用淺層MLP粗略編碼目標相機。在後期訓練期間,我們對更少數量的高分辨率視圖進行降噪,並使用像素對齊的控制(例如,空間錨點和普拉克射線)來實現三維一致的生成。在推理階段,我們提出了一種注意偏置技術,使Pippo能夠同時生成超過訓練期間所見視圖的5倍以上。最後,我們還引入了一種改進的指標來評估多視圖生成的三維一致性,並展示Pippo在從單張圖像生成多視圖人物方面優於現有作品。
English
We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on multi-view human generation from a single image.

Summary

AI-Generated Summary

PDF112February 12, 2025