HumanOmniV2: コンテキストを伴うオムニモーダル推論への理解

要旨

マルチモーダル大規模言語モデルの急速な進化に伴い、人間の意図を深く理解し解釈する能力が重要な機能として浮上しており、これは詳細かつ慎重な推論を必要とします。最近の研究では、強化学習（RL）が大規模言語モデル（LLM）の推論能力を向上させる可能性を示しています。しかしながら、マルチモーダルデータやフォーマットにRLを適応させる際の課題は、ほとんど未解決のままです。本論文では、既存のマルチモーダル推論モデルにおける2つの問題を指摘します：グローバルコンテキスト理解の不十分さとショートカット問題です。グローバルコンテキスト理解の不十分さは、モデルがマルチモーダルコンテキストを誤解し、誤った答えを導く場合に発生します。ショートカット問題は、モデルがマルチモーダル入力における重要な手がかりを見落とし、マルチモーダル情報を考慮せずに直接クエリに対処する場合に起こります。これらの問題に対処するため、モデルがマルチモーダル入力内のグローバルコンテキストを明確に理解して推論する必要性を強調します。このグローバルコンテキスト理解は、モデルが重要なマルチモーダル手がかりを見落とすのを効果的に防ぎ、徹底的な推論プロセスを保証します。マルチモーダルコンテキスト情報の正確な解釈を確保するため、大規模言語モデルによって判断されるコンテキスト報酬を、フォーマットと精度の報酬とともに実装します。さらに、複雑な推論能力を向上させるため、LLMを使用して論理報酬を評価し、推論プロセスがマルチモーダル情報を論理的手法と統合できたかどうかを判断します。また、複雑な人間の意図や感情を理解するためのモデル評価を目的とした推論オムニモーダルベンチマーク、IntentBenchを導入します。提案手法は、他のオープンソースのオムニモーダルモデルと比較して、複数のオムニモーダルベンチマークで先進的な性能を示しています。

English

With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.

HumanOmniV2: コンテキストを伴うオムニモーダル推論への理解

HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context

要旨

Support