트로브: 밀집 검색을 위한 유연한 툴킷

초록

Trove는 사용하기 쉬운 오픈소스 검색 도구 키트로, 유연성이나 속도를 저하시키지 않으면서 연구 실험을 단순화합니다. 우리는 처음으로 단 몇 줄의 코드만으로 검색 데이터셋을 실시간으로 불러와 처리(필터링, 선택, 변환, 결합)하는 효율적인 데이터 관리 기능을 소개합니다. 이를 통해 사용자는 대용량 데이터셋의 여러 복사본을 계산하고 저장할 필요 없이 다양한 데이터셋 구성을 쉽게 실험할 수 있는 유연성을 얻습니다. Trove는 매우 사용자 정의가 가능합니다: 다양한 내장 옵션 외에도, 사용자가 기존 구성 요소를 자유롭게 수정하거나 사용자 정의 객체로 완전히 대체할 수 있습니다. 또한 평가와 하드 네거티브 마이닝을 위한 로우 코드 및 통합 파이프라인을 제공하며, 코드 변경 없이 다중 노드 실행을 지원합니다. Trove의 데이터 관리 기능은 메모리 사용량을 2.6배 절감합니다. 더 나아가, Trove의 사용하기 쉬운 추론 파이프라인은 오버헤드가 없으며, 추론 시간은 사용 가능한 노드 수에 따라 선형적으로 감소합니다. 가장 중요한 것은 Trove가 검색 실험을 어떻게 단순화하고 임의의 사용자 정의를 가능하게 하여 탐색적 연구를 촉진하는지 보여준다는 점입니다.

English

We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove's data management features reduce memory consumption by a factor of 2.6. Moreover, Trove's easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.