本当に必要なのが検索だけなら、それは本当に長文脈なのか？真に困難な長文脈NLPに向けて

要旨

言語モデルの能力向上に伴い、その応用はより長い文脈へと拡大し、長文脈の評価と開発が活発な研究領域となっています。しかし、「長文脈」という包括的な用語の下には、モデルの入力の総長さによって単純に定義される多様なユースケースが混在しています。例えば、「干し草の山の中の針」タスク、書籍要約、情報集約などが含まれます。これらのタスクの難易度は多様であるため、本ポジションペーパーでは、文脈の長さによって異なるタスクを混同することは非生産的であると主張します。我々は、長文脈タスクの類似点や相違点を理解するためにより精密な語彙が必要であると考えます。そこで、長文脈に基づく分類体系を、文脈が長くなることで難しくなる特性に基づいて解きほぐすことを提案します。難易度の2つの直交する軸を提案します：(I) 拡散性：必要な情報を文脈内で見つけることがどれほど難しいか？(II) 範囲：見つける必要のある情報の量はどれくらいか？長文脈に関する文献を調査し、この分類体系が有益な記述子であることを正当化し、文献をそれに基づいて位置づけます。最も難しく興味深い設定、すなわち必要な情報が非常に長く、入力内に高度に拡散している状況が、深刻に未開拓であると結論づけます。記述的な語彙を使用し、長文脈における難易度の関連特性を議論することで、この分野におけるより情報に基づいた研究を実施できます。我々は、短い文脈とは質的に異なる特性を考慮した、明確に長い文脈を有するタスクとベンチマークの慎重な設計を呼びかけます。

English

Improvements in language models' capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of "long-context", defined simply by the total length of the model's input, including - for example - Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.

本当に必要なのが検索だけなら、それは本当に長文脈なのか？真に困難な長文脈NLPに向けて

Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

要旨

Support