効率性と有効性を考慮したLLMベースリランカーのFLOPs再ランキング

要旨

大規模言語モデル（LLMs）は、最近、情報検索における再ランキングタスクに適用され、高い性能を達成している。しかし、その高い計算需要は、実用的な展開をしばしば妨げている。既存の研究では、LLMベースの再ランキングシステムの効率を、レイテンシ、フォワードパスの回数、入力トークン数、出力トークン数などの代理指標を用いて評価している。しかし、これらの指標はハードウェアや実行時の選択（例えば、並列処理の有無、バッチサイズなど）に依存し、モデルサイズを考慮しないことが多く、解釈が困難であり、効率と効果のトレードオフの評価を曖昧にしている。この問題に対処するため、我々はLLMベースの再ランキングシステムに対して、E2R-FLOPsを提案する：関連性を計算量で評価するためのPetaFLOPあたりのランキング指標（RPP）と、ハードウェアに依存しないスループットを評価するためのPetaFLOPあたりのクエリ数（QPP）である。新しい指標とともに、実験を実行せずにLLMベースの再ランキングシステムのFLOPsを推定するための解釈可能なFLOPs推定器を構築した。提案された指標に基づいて、我々は幅広いアーキテクチャを持つLLMベースの再ランキングシステムを評価するための包括的な実験を行い、効率と効果のトレードオフを研究し、この問題を研究コミュニティの注目にさらす。

English

Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose E2R-FLOPs, for LLM-based rerankers: ranking metrics per PetaFLOP (RPP) for relevance per compute and queries per PetaFLOP (QPP) for hardware-agnostic throughput. Companied with the new metrics, an interpretable FLOPs estimator is built to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architecture, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.

効率性と有効性を考慮したLLMベースリランカーのFLOPs再ランキング

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

要旨

Support