Safe-Sora：通过图形水印技术实现安全的文本到视频生成

摘要

生成式视频模型的爆炸性增长，加大了对AI生成内容可靠版权保护的需求。尽管隐形生成水印在图像合成中广受欢迎，但在视频生成领域仍鲜有探索。为填补这一空白，我们提出了Safe-Sora，这是首个将图形水印直接嵌入视频生成过程的框架。受水印性能与水印和载体内容视觉相似度密切相关的启发，我们引入了一种从粗到细的层次化自适应匹配机制。具体而言，水印图像被分割成多个区块，每个区块被分配到视觉上最相似的视频帧，并进一步定位到最佳空间区域以实现无缝嵌入。为了实现水印区块在视频帧间的时空融合，我们开发了一种基于3D小波变换增强的Mamba架构，采用新颖的时空局部扫描策略，有效建模了水印嵌入与检索过程中的长程依赖关系。据我们所知，这是首次将状态空间模型应用于水印技术，为高效且鲁棒的水印保护开辟了新途径。大量实验表明，Safe-Sora在视频质量、水印保真度和鲁棒性方面均达到了业界领先水平，这主要归功于我们的创新方案。我们将在论文发表后公开代码。

English

The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel spatiotemporal local scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. We will release our code upon publication.