다중모달 구조적 생성: CVPR 제2회 MMFM 챌린지 기술 보고서

초록

멀티모달 파운데이션 모델(MMFMs)은 다양한 컴퓨터 비전 및 자연어 처리 작업에서 뛰어난 성능을 보여왔습니다. 그러나 문서 이해와 같은 특정 작업에서의 성능은 여전히 제한적입니다. 또한, 전통적인 단일 모달 모델에 비해 미세 조정 및 배포에 더 많은 컴퓨팅 자원, 시간, 엔지니어링 리소스가 필요합니다. 본 보고서에서는 멀티모달 구조화 생성(Multimodal Structured Generation)이라는 일반적인 프레임워크를 제시합니다. 이 프레임워크는 고정된 MMFMs의 출력 로짓을 제한하여, 다운스트림 API가 파싱하고 사용할 수 있는 구조화된 출력을 응답하기 전에 추론하도록 강제합니다. 우리는 컴퓨터 비전 및 패턴 인식(CVPR) 컨퍼런스에서 주최한 제2회 멀티모달 파운데이션 모델 챌린지에서의 접근 방식, 기술적 세부 사항, 이론적 논의 및 최종 평가 결과를 상세히 설명합니다. 우리의 접근 방식은 Phase 2의 숨겨진 테스트 세트에서 두 번째로 높은 점수를 얻었으며, 전체적으로 세 번째로 높은 성적을 기록했습니다. 이는 이 방법이 보이지 않는 작업에 일반화할 수 있는 능력을 보여줍니다. 또한, 우리가 논문 "Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use"에서 처음 논의한 바와 같이, 간단한 엔지니어링이 비용이 많이 들고 복잡한 모델링 단계를 능가할 수 있음을 보여줍니다. 우리의 모든 스크립트, 배포 단계 및 평가 결과는 https://github.com/leloykun/MMFM-Challenge에서 확인할 수 있습니다.

English

Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method's ability to generalize to unseen tasks. And that simple engineering can beat expensive & complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use. All of our scripts, deployment steps, and evaluation results can be accessed in https://github.com/leloykun/MMFM-Challenge

다중모달 구조적 생성: CVPR 제2회 MMFM 챌린지 기술 보고서

Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

초록

Support