TAGS: 검증 및 검색 강화 추론을 통한 테스트 타임 제너럴리스트-스페셜리스트 프레임워크

초록

최근 Chain-of-Thought 프롬프팅과 같은 발전은 대형 언어 모델(LLMs)의 제로샷 의료 추론 능력을 크게 향상시켰습니다. 그러나 프롬프팅 기반 방법은 여전히 피상적이고 불안정한 반면, 미세 조정된 의료 LLMs는 분포 변화에서의 일반화 능력이 떨어지고 보이지 않는 임상 시나리오에 대한 적응성이 제한적입니다. 이러한 한계를 해결하기 위해, 우리는 TAGS를 제안합니다. TAGS는 모델 미세 조정이나 매개변수 업데이트 없이도 일반적인 능력을 가진 범용 모델과 도메인 특화 전문가를 결합하여 상호 보완적인 관점을 제공하는 테스트 시점 프레임워크입니다. 이 일반가-전문가 추론 과정을 지원하기 위해, 우리는 두 가지 보조 모듈을 도입했습니다: 첫째, 의미적 및 근거 수준 유사성을 기반으로 예제를 선택하여 다중 규모의 예시를 제공하는 계층적 검색 메커니즘, 둘째, 최종 답변 집계를 안내하기 위해 추론 일관성을 평가하는 신뢰도 평가기입니다. TAGS는 9개의 MedQA 벤치마크에서 강력한 성능을 보이며, GPT-4o의 정확도를 13.8%, DeepSeek-R1의 정확도를 16.8% 향상시키고, 기본 7B 모델의 정확도를 14.1%에서 23.9%로 개선했습니다. 이러한 결과는 여러 미세 조정된 의료 LLMs를 능가하며, 매개변수 업데이트 없이 달성되었습니다. 코드는 https://github.com/JianghaoWu/TAGS에서 확인할 수 있습니다.

English

Recent advances such as Chain-of-Thought prompting have significantly improved large language models (LLMs) in zero-shot medical reasoning. However, prompting-based methods often remain shallow and unstable, while fine-tuned medical LLMs suffer from poor generalization under distribution shifts and limited adaptability to unseen clinical scenarios. To address these limitations, we present TAGS, a test-time framework that combines a broadly capable generalist with a domain-specific specialist to offer complementary perspectives without any model fine-tuning or parameter updates. To support this generalist-specialist reasoning process, we introduce two auxiliary modules: a hierarchical retrieval mechanism that provides multi-scale exemplars by selecting examples based on both semantic and rationale-level similarity, and a reliability scorer that evaluates reasoning consistency to guide final answer aggregation. TAGS achieves strong performance across nine MedQA benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several fine-tuned medical LLMs, without any parameter updates. The code will be available at https://github.com/JianghaoWu/TAGS.

TAGS: 검증 및 검색 강화 추론을 통한 테스트 타임 제너럴리스트-스페셜리스트 프레임워크

TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification

초록

Support