Tinker：擴散模型賦能3D——無需逐場景優化即可從稀疏輸入實現多視圖一致性編輯

摘要

我們介紹了Tinker，這是一個用於高保真3D編輯的多功能框架，能夠在無需針對每個場景進行微調的情況下，於單次和少量樣本模式下運作。與以往需要大量針對每個場景進行優化以確保多視圖一致性或生成數十個一致編輯輸入視圖的技術不同，Tinker僅需一兩張圖像即可提供穩健且多視圖一致的編輯結果。這一能力源自於重新利用預訓練的擴散模型，從而釋放其潛在的3D感知能力。為了推動這一領域的研究，我們策劃了首個大規模多視圖編輯數據集及數據處理流程，涵蓋多樣化的場景和風格。基於此數據集，我們開發了無需針對每個場景進行訓練即可生成多視圖一致編輯視圖的框架，該框架包含兩個新穎組件：(1) 參考多視圖編輯器：實現精確的、參考驅動的編輯，確保所有視角下的連貫性。(2) 任意視圖到視頻合成器：利用視頻擴散的時空先驗，即使從稀疏輸入也能執行高質量的場景補全和新視圖生成。通過大量實驗，Tinker顯著降低了通用3D內容創作的門檻，在編輯、新視圖合成和渲染增強任務上達到了最先進的性能。我們相信，Tinker代表了邁向真正可擴展、零樣本3D編輯的關鍵一步。項目網頁：https://aim-uofa.github.io/Tinker

English

We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker

Tinker：擴散模型賦能3D——無需逐場景優化即可從稀疏輸入實現多視圖一致性編輯

Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

摘要

Support