TAGS:一种测试时通用-专用框架,结合检索增强推理与验证
TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification
May 23, 2025
作者: Jianghao Wu, Feilong Tang, Yulong Li, Ming Hu, Haochen Xue, Shoaib Jameel, Yutong Xie, Imran Razzak
cs.AI
摘要
近期,如思维链提示(Chain-of-Thought prompting)等进展显著提升了大型语言模型(LLMs)在零样本医疗推理中的表现。然而,基于提示的方法往往显得浅显且不稳定,而经过微调的医疗LLMs则在分布变化下泛化能力差,对未见过的临床场景适应有限。为应对这些局限,我们提出了TAGS,一个测试时框架,它结合了广泛能力的通才与领域专才,无需模型微调或参数更新,即可提供互补视角。为支持这一通才-专才推理过程,我们引入了两个辅助模块:一个分层检索机制,通过基于语义和推理层面相似性选择示例,提供多尺度范例;以及一个可靠性评分器,评估推理一致性以指导最终答案的聚合。TAGS在九项MedQA基准测试中均表现出色,将GPT-4o的准确率提升了13.8%,DeepSeek-R1提升了16.8%,并将一个基础7B模型的准确率从14.1%提升至23.9%。这些成果超越了多个经过微调的医疗LLMs,且无需任何参数更新。代码将在https://github.com/JianghaoWu/TAGS 提供。
English
Recent advances such as Chain-of-Thought prompting have significantly
improved large language models (LLMs) in zero-shot medical reasoning. However,
prompting-based methods often remain shallow and unstable, while fine-tuned
medical LLMs suffer from poor generalization under distribution shifts and
limited adaptability to unseen clinical scenarios. To address these
limitations, we present TAGS, a test-time framework that combines a broadly
capable generalist with a domain-specific specialist to offer complementary
perspectives without any model fine-tuning or parameter updates. To support
this generalist-specialist reasoning process, we introduce two auxiliary
modules: a hierarchical retrieval mechanism that provides multi-scale exemplars
by selecting examples based on both semantic and rationale-level similarity,
and a reliability scorer that evaluates reasoning consistency to guide final
answer aggregation. TAGS achieves strong performance across nine MedQA
benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and
improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several
fine-tuned medical LLMs, without any parameter updates. The code will be
available at https://github.com/JianghaoWu/TAGS.Summary
AI-Generated Summary