オンデバイスのSora：モバイルデバイス向けの拡散ベースのテキストからビデオへの生成を可能にする

要旨

オンデバイスSoraを提案します。これは、スマートフォン向けに効率的に動作する拡散ベースのオンデバイステキストからビデオへの生成の初の先駆的なソリューションです。Open-Soraをベースに構築されたオンデバイスSoraは、計算およびメモリに制限のあるモバイルデバイス上での拡散ベースのテキストからビデオへの生成の課題に対処するために、3つの新しい技術を適用しています。まず、リニアプロポーショナルリープ（LPL）は、効率的なリープベースのアプローチを通じて、ビデオの拡散に必要な過剰なノイズ除去ステップを削減します。2つ目は、時間次元トークンマージング（TDTM）であり、注意層における集中的なトークン処理計算を、時間次元に沿って連続するトークンをマージすることで最小限に抑えます。3つ目は、動的ロードによる同時推論（CI-DL）であり、大きなモデルを小さなブロックに動的に分割し、メモリに読み込んで同時モデル推論を行うことで、デバイスメモリの制限に効果的に対処します。オンデバイスSoraをiPhone 15 Proに実装し、実験評価により、高品質なビデオを生成する能力があり、高性能GPU上で実行されるOpen-Soraに匹敵します。これらの結果は、オンデバイスSoraがリソースに制約のあるモバイルデバイスで効率的かつ高品質なビデオ生成を可能にし、アクセシビリティを拡大し、ユーザーのプライバシーを保護し、クラウドインフラへの依存を減らし、関連するコストを削減することを示しています。提案されたオンデバイスSoraは、最先端の生成技術を民主化し、コモディティモバイルおよび組み込みデバイスでのビデオ生成機能を可能にする重要な第一歩として展望されます。コードの実装はGitHubリポジトリで公開されています：https://github.com/eai-lab/On-device-Sora。

English

We present On-device Sora, a first pioneering solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. Building on Open-Sora, On-device Sora applies three novel techniques to address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations demonstrate that it is capable of generating high-quality videos on the device, comparable to those produced by Open-Sora running on high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices, expanding accessibility, ensuring user privacy, reducing dependence on cloud infrastructure, and lowering associated costs. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation capabilities on commodity mobile and embedded devices. The code implementation is publicly available at an GitHub repository: https://github.com/eai-lab/On-device-Sora.

オンデバイスのSora：モバイルデバイス向けの拡散ベースのテキストからビデオへの生成を可能にする

On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices

要旨

Support