ChatPaper.aiChatPaper

LTX-2:高效聯合視聽基礎模型

LTX-2: Efficient Joint Audio-Visual Foundation Model

January 6, 2026
作者: Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaron Inger, Yonatan Shiftan, Zeev Melumian, Zeev Farbman
cs.AI

摘要

近期文字轉影片擴散模型雖能生成引人入勝的影片序列,卻始終處於「靜默」狀態——缺失了音訊所提供的語義、情感與氛圍線索。我們推出 LTX-2,這款開源基礎模型能以統一方式生成高品質、時間同步的視聽內容。LTX-2 採用非對稱雙流 Transformer 架構,包含 140 億參數的影片流與 50 億參數的音訊流,通過雙向視聽交叉注意力層進行耦合。該架構融合時間位置嵌入與跨模態 AdaLN 技術,實現共享時間步條件調控,在確保統一視聽模型高效訓練與推理的同時,為影片生成分配比音訊生成更強的運算能力。我們採用多語言文字編碼器以擴展提示詞理解範圍,並引入模態感知的無分類器引導機制(modality-CFG),顯著提升視聽對齊效果與可控性。LTX-2 不僅能生成語音,更可產出豐富連貫的音軌,精準跟隨每個場景的角色、環境、風格與情感變化——甚至包含自然的背景音與擬聲音效。評估結果顯示,本模型在開源系統中實現了視聽品質與提示詞遵循度的最先進水準,同時以遠低於專有模型的計算成本與推理時間,達成與之媲美的生成效果。所有模型權重與程式碼均已公開釋出。
English
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.
PDF400January 8, 2026