LumosX:关联任意身份特征与属性实现个性化视频生成
LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
March 20, 2026
作者: Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, Yong Liu
cs.AI
摘要
扩散模型的最新进展显著提升了文本到视频的生成能力,实现了对前景与背景元素的细粒度可控个性化内容生成。然而,跨主体的精确人脸属性对齐仍具挑战性,现有方法缺乏确保组内一致性的显式机制。解决这一难题需要显式建模策略与人脸属性感知数据资源的双重突破。为此,我们提出LumosX框架,在数据和模型设计层面同步推进。数据层面,通过定制化采集流程协调来自独立视频的字幕与视觉线索,同时利用多模态大语言模型推断并分配主体特定的依赖关系。这些提取的关系先验施加了更细粒度的结构约束,既增强了个性化视频生成的表达控制力,又支撑了综合性基准数据集的构建。模型层面,关系自注意力与关系交叉注意力机制将位置感知嵌入与优化的注意力动态相融合,刻画出显式的主体-属性依赖关系,从而强化组内凝聚力并放大不同主体集群间的区分度。在我们构建的基准测试上的综合评估表明,LumosX在细粒度、身份一致性和语义对齐的个性化多主体视频生成任务中达到了最先进性能。代码与模型已开源:https://jiazheng-xing.github.io/lumosx-home/。
English
Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.