Safe-Sora：通過圖形水印實現安全的文本到視頻生成

摘要

生成式視頻模型的爆炸性增長，加劇了對AI生成內容可靠版權保護的需求。儘管隱形生成水印在圖像合成中廣受歡迎，但在視頻生成領域仍鮮有探索。為填補這一空白，我們提出了Safe-Sora，這是首個將圖形水印直接嵌入視頻生成過程的框架。基於水印性能與水印和封面內容視覺相似度密切相關的觀察，我們引入了一種分層的從粗到細的自適應匹配機制。具體而言，水印圖像被分割成多個區塊，每個區塊被分配給視覺上最相似的視頻幀，並進一步定位到最佳空間區域以實現無縫嵌入。為了實現水印區塊在視頻幀間的時空融合，我們開發了一種3D小波變換增強型Mamba架構，配備新穎的時空局部掃描策略，有效建模了水印嵌入和檢索過程中的長程依賴關係。據我們所知，這是首次將狀態空間模型應用於水印技術，為高效且魯棒的水印保護開闢了新途徑。大量實驗表明，Safe-Sora在視頻質量、水印保真度和魯棒性方面均達到了最先進的性能，這在很大程度上歸功於我們的提案。我們將在論文發表後公開代碼。

English

The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel spatiotemporal local scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. We will release our code upon publication.