DARE：透過分佈感知檢索將大型語言模型代理與R統計生態系統對齊

摘要

大型語言模型（LLM）代理能夠自動化資料科學工作流程，但由於LLM在統計知識與工具檢索方面存在侷限，許多透過R語言實現的嚴謹統計方法仍未被充分利用。現有的檢索增強方法側重函數層級語義而忽略資料分佈特性，導致檢索結果次優。我們提出DARE（分佈感知檢索嵌入模型），這是一種輕量級即插即用的檢索模型，能將資料分佈資訊融入函數表徵以提升R套件檢索效能。本研究主要貢獻包括：（i）RPKB知識庫，從8,191個高品質CRAN套件中精煉構建的R套件知識庫；（ii）DARE嵌入模型，融合分佈特徵與函數元資料以提升檢索相關性；（iii）RCodingAgent專用代理，專為R語言設計的LLM代理，能可靠生成R程式碼，並配備一套統計分析任務集，用於在真實分析場景中系統性評估LLM代理。實證結果顯示，DARE在NDCG@10指標上達到93.47%，僅用顯著更少的參數即可在套件檢索任務上超越頂尖開源嵌入模型達17%。將DARE整合至RCodingAgent後，下游分析任務效能顯著提升。此研究有助於縮小LLM自動化與成熟R統計生態系統之間的差距。

English

Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG at 10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.

DARE：透過分佈感知檢索將大型語言模型代理與R統計生態系統對齊

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

摘要

Support