ChatPaper.aiChatPaper

DARE:通过分布感知检索实现LLM智能体与R统计生态系统的对齐

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

March 5, 2026
作者: Maojun Sun, Yue Wu, Yifei Xie, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang
cs.AI

摘要

大型语言模型(LLM)智能体能够自动化数据科学工作流,但由于LLM在统计知识与工具检索方面存在不足,许多基于R语言实现的严谨统计方法仍未被充分利用。现有检索增强方法聚焦于函数级语义而忽略数据分布,导致检索结果欠佳。我们提出DARE(分布感知检索嵌入模型),这是一种轻量级即插即用检索模型,通过将数据分布信息融入函数表征来优化R包检索。主要贡献包括:(i)RPKB知识库——从8,191个高质量CRAN包中精心构建的R包知识库;(ii)DARE嵌入模型——融合分布特征与函数元数据以提升检索相关性的方法;(iii)RCodingAgent——面向R语言的LLM智能体,用于可靠生成R代码,并配套一套统计分析任务集以系统评估现实分析场景中的LLM智能体。实验表明,DARE在R包检索任务中NDCG@10达到93.47%,以显著更少的参数量优于当前最优开源嵌入模型达17%。将DARE集成至RCodingAgent可在下游分析任务中实现显著性能提升。本研究有助于缩小LLM自动化与成熟R统计生态系统之间的差距。
English
Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG at 10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.
PDF453March 9, 2026