LiFT: Aprovechando la Retroalimentación Humana para la Alineación del Modelo de Texto a Video

Resumen

Los avances recientes en modelos generativos de texto a video (T2V) han demostrado capacidades impresionantes. Sin embargo, estos modelos todavía son insuficientes para alinear los videos sintetizados con las preferencias humanas (por ejemplo, reflejar con precisión descripciones de texto), lo cual es particularmente difícil de abordar, ya que las preferencias humanas son inherentemente subjetivas y desafiantes de formalizar como funciones objetivas. Por lo tanto, este documento propone LiFT, un método novedoso de ajuste fino que aprovecha la retroalimentación humana para la alineación del modelo T2V. Específicamente, primero construimos un conjunto de datos de Anotaciones de Calificación Humana, LiFT-HRA, que consta de aproximadamente 10k anotaciones humanas, cada una incluyendo una puntuación y su justificación correspondiente. Con base en esto, entrenamos un modelo de recompensa LiFT-Critic para aprender de manera efectiva la función de recompensa, que sirve como un proxy para el juicio humano, midiendo la alineación entre los videos dados y las expectativas humanas. Por último, aprovechamos la función de recompensa aprendida para alinear el modelo T2V maximizando la probabilidad ponderada por la recompensa. Como estudio de caso, aplicamos nuestro proceso a CogVideoX-2B, demostrando que el modelo ajustado supera al CogVideoX-5B en las 16 métricas, resaltando el potencial de la retroalimentación humana en mejorar la alineación y calidad de los videos sintetizados.

English

Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k human annotations, each including a score and its corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos.

LiFT: Aprovechando la Retroalimentación Humana para la Alineación del Modelo de Texto a Video

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

Resumen

Support