효율성-효과성 재순위화 FLOPs: LLM 기반 재순위화기를 위한 접근

초록

대형 언어 모델(LLM)은 최근 정보 검색에서의 재순위화 작업에 적용되어 강력한 성능을 보여주고 있다. 그러나 이들의 높은 계산 요구량은 실제 배포를 방해하는 경우가 많다. 기존 연구들은 LLM 기반 재순위화기의 효율성을 지연 시간, 순방향 패스 횟수, 입력 토큰 수, 출력 토큰 수와 같은 대리 지표를 사용하여 평가한다. 그러나 이러한 지표들은 하드웨어 및 실행 시간 선택(예: 병렬 여부, 배치 크기 등)에 의존하며, 종종 모델 크기를 고려하지 않아 해석이 어렵고 효율성-효과성 트레이드오프 평가를 모호하게 만든다. 이 문제를 해결하기 위해, 우리는 LLM 기반 재순위화기를 위한 E2R-FLOPs를 제안한다: 계산당 관련성을 나타내는 PetaFLOP당 순위 지표(RPP)와 하드웨어에 독립적인 처리량을 나타내는 PetaFLOP당 쿼리 수(QPP). 새로운 지표와 함께, 실험을 실행하지 않고도 LLM 기반 재순위화기의 FLOPs를 추정할 수 있는 해석 가능한 FLOPs 추정기를 구축하였다. 제안된 지표를 기반으로, 우리는 다양한 아키텍처를 가진 LLM 기반 재순위화기를 평가하기 위한 포괄적인 실험을 수행하여 효율성-효과성 트레이드오프를 연구하고 이 문제를 연구 커뮤니티의 주목으로 이끌었다.

English

Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose E2R-FLOPs, for LLM-based rerankers: ranking metrics per PetaFLOP (RPP) for relevance per compute and queries per PetaFLOP (QPP) for hardware-agnostic throughput. Companied with the new metrics, an interpretable FLOPs estimator is built to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architecture, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.

효율성-효과성 재순위화 FLOPs: LLM 기반 재순위화기를 위한 접근

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

초록

Support