LaTtE-Flow: 계층별 타임스텝 전문가 플로우 기반 트랜스포머

초록

이미지 이해와 생성을 통합한 멀티모달 기반 모델의 최근 발전은 단일 프레임워크 내에서 다양한 시각-언어 작업을 해결할 수 있는 흥미로운 가능성을 열어주었습니다. 그러나 기존의 통합 모델들은 일반적으로 광범위한 사전 학습이 필요하며, 각 작업에 특화된 모델들과 동일한 수준의 성능을 달성하는 데 어려움을 겪습니다. 또한, 이러한 모델들 중 다수는 느린 이미지 생성 속도로 인해 실시간 또는 자원이 제한된 환경에서의 실제 배포가 제한됩니다. 본 연구에서는 이미지 이해와 생성을 단일 멀티모달 모델 내에서 통합하는 새로운 효율적인 아키텍처인 Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow)를 제안합니다. LaTtE-Flow는 강력한 사전 학습된 시각-언어 모델(VLMs)을 기반으로 하여 강력한 멀티모달 이해 능력을 상속받고, 이를 효율적인 이미지 생성을 위한 새로운 Layerwise Timestep Experts flow-based 아키텍처로 확장합니다. LaTtE-Flow는 플로우 매칭 프로세스를 특화된 Transformer 레이어 그룹들에 분산시켜, 각 그룹이 특정 시간 단계의 하위 집합을 담당하도록 설계되었습니다. 이 설계는 각 샘플링 시간 단계에서 소수의 레이어만 활성화함으로써 샘플링 효율성을 크게 향상시킵니다. 성능을 더욱 향상시키기 위해, 우리는 레이어 간 효율적인 정보 재사용을 위한 Timestep-Conditioned Residual Attention 메커니즘을 제안합니다. 실험 결과, LaTtE-Flow는 멀티모달 이해 작업에서 강력한 성능을 달성하는 동시에, 최근의 통합 멀티모달 모델들과 비교하여 약 6배 빠른 추론 속도로 경쟁력 있는 이미지 생성 품질을 달성함을 보여줍니다.

English

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

LaTtE-Flow: 계층별 타임스텝 전문가 플로우 기반 트랜스포머

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

초록

Support