비디오는 4096개의 토큰과 같다: 스토리 비디오를 언어화하여 제로샷에서 이해하기

초록

광고 및 스토리 비디오와 같은 멀티미디어 콘텐츠는 창의성과 다양한 모달리티가 풍부하게 혼합된 형태를 보여줍니다. 이러한 콘텐츠는 텍스트, 시각적 요소, 오디오, 스토리텔링 기법과 같은 요소를 통합하며, 감정, 상징, 슬로건과 같은 장치를 활용하여 의미를 전달합니다. 기존의 멀티미디어 이해 연구는 주로 요리와 같은 특정 동작이 포함된 비디오에 초점을 맞추어 왔으며, 대규모로 주석이 달린 훈련 데이터셋의 부재로 인해 실제 응용에서 만족스러운 성능을 보이는 지도 학습 모델의 개발이 어려웠습니다. 그러나 대규모 언어 모델(LLM)의 등장으로 감정 분류, 질문-응답, 주제 분류와 같은 다양한 자연어 처리(NLP) 작업에서 놀라운 제로샷 성능이 관찰되었습니다. 멀티미디어 이해에서의 이러한 성능 격차를 해소하기 위해, 우리는 스토리 비디오를 자연어로 설명하는 방식으로 변환한 후, 원본 비디오 대신 생성된 스토리에 대해 비디오 이해 작업을 수행하는 방법을 제안합니다. 다섯 가지 비디오 이해 작업에 대한 광범위한 실험을 통해, 우리의 방법이 제로샷 접근임에도 불구하고 비디오 이해를 위한 지도 학습 베이스라인보다 훨씬 더 나은 결과를 달성함을 입증했습니다. 또한, 스토리 이해 벤치마크의 부족을 해소하기 위해, 계산 사회과학에서 중요한 작업인 설득 전략 식별에 대한 최초의 데이터셋을 공개합니다.

English

Multimedia content, such as advertisements and story videos, exhibit a rich blend of creativity and multiple modalities. They incorporate elements like text, visuals, audio, and storytelling techniques, employing devices like emotions, symbolism, and slogans to convey meaning. While previous research in multimedia understanding has focused mainly on videos with specific actions like cooking, there is a dearth of large annotated training datasets, hindering the development of supervised learning models with satisfactory performance for real-world applications. However, the rise of large language models (LLMs) has witnessed remarkable zero-shot performance in various natural language processing (NLP) tasks, such as emotion classification, question-answering, and topic classification. To bridge this performance gap in multimedia understanding, we propose verbalizing story videos to generate their descriptions in natural language and then performing video-understanding tasks on the generated story as opposed to the original video. Through extensive experiments on five video-understanding tasks, we demonstrate that our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding. Further, alleviating a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science, persuasion strategy identification.

비디오는 4096개의 토큰과 같다: 스토리 비디오를 언어화하여 제로샷에서 이해하기

A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot

초록

Support