MovieLLM：利用人工智慧生成的電影來增強對長視頻的理解

摘要

多模型的發展標誌著機器理解視頻的重大進步。這些模型在分析短視頻片段方面表現出潛力。然而，當涉及長片等更長格式時，它們通常表現不佳。主要障礙在於缺乏高質量、多樣化的視頻數據，以及收集或標註此類數據所需的大量工作。面對這些挑戰，我們提出了MovieLLM，這是一個新穎的框架，旨在為長視頻創建合成的高質量數據。該框架利用了GPT-4和文本到圖像模型的強大功能，生成詳細的劇本和相應的視覺效果。我們的方法以其靈活性和可擴展性脫穎而出，成為傳統數據收集方法的優越替代方案。我們廣泛的實驗證實，由MovieLLM生成的數據顯著提高了多模型在理解複雜視頻敘事方面的性能，克服了現有數據集在稀缺性和偏見方面的限制。

English

The development of multimodal models has marked a significant step forward in how machines understand videos. These models have shown promise in analyzing short video clips. However, when it comes to longer formats like movies, they often fall short. The main hurdles are the lack of high-quality, diverse video data and the intensive work required to collect or annotate such data. In the face of these challenges, we propose MovieLLM, a novel framework designed to create synthetic, high-quality data for long videos. This framework leverages the power of GPT-4 and text-to-image models to generate detailed scripts and corresponding visuals. Our approach stands out for its flexibility and scalability, making it a superior alternative to traditional data collection methods. Our extensive experiments validate that the data produced by MovieLLM significantly improves the performance of multimodal models in understanding complex video narratives, overcoming the limitations of existing datasets regarding scarcity and bias.

MovieLLM：利用人工智慧生成的電影來增強對長視頻的理解

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

摘要

Support