LTX-2:高效联合视听基础模型
LTX-2: Efficient Joint Audio-Visual Foundation Model
January 6, 2026
作者: Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaron Inger, Yonatan Shiftan, Zeev Melumian, Zeev Farbman
cs.AI
摘要
当前文生视频扩散模型虽能生成引人入胜的视频序列,却始终处于静默状态——缺失了音频所提供的语义、情感与氛围线索。我们推出LTX-2这一开源基础模型,能以统一方式生成高质量、时序同步的视听内容。该模型采用非对称双流Transformer架构,包含140亿参数的视频流与50亿参数的音频流,通过具有时序位置编码的双向视听交叉注意力层与跨模态AdaLN模块实现共享时间步条件耦合。该架构在保证统一视听模型高效训练与推理的同时,为视频生成分配了比音频生成更高的参数量。我们采用多语言文本编码器以提升提示词理解广度,并引入模态感知的无分类器引导机制(modality-CFG)来增强视听对齐能力与可控性。除生成语音外,LTX-2还能制作与场景角色、环境、风格及情绪相契合的丰富连贯音轨——包含自然的背景音与拟声音效。评估表明,该模型在开源系统中实现了最先进的视听质量与提示词遵循度,同时以远低于专有模型的计算成本与推理时间达到可比拟的效果。所有模型权重与代码均已开源发布。
English
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.