RbtAct: 실행 가능한 리뷰 피드백 생성을 위한 감독 학습으로서의 반박

초록

대규모 언어 모델(LLM)은 동료 평가 보고서 초안 작성에 이르기까지 과학 연구 워크플로 전반에 걸쳐 점점 더 많이 활용되고 있습니다. 그러나 많은 AI 생성 평가는 피상적이고 실행 가능성이 부족하여 저자에게 구체적이고 실천 가능한 지침을 제공하지 못하며, 이러한 격차를 해결하려는 본 연구의 동기가 됩니다. 우리는 실행 가능한 평가 피드백 생성을 목표로 기존 동료 평가 반론(rebuttal)을 학습의 중심에 두는 RbtAct를 제안합니다. 반론은 어떤 평가자 의견이 구체적인 수정이나 특정 계획으로 이어졌고, 어떤 의견이 단순히 방어만 되었는지를 보여줍니다. 이러한 통찰을 바탕으로, 우리는 실행 가능성을 위해 피드백 생성기를 직접 최적화하는 암시적 감독(supervision)으로서 반론을 활용합니다. 이 목표를 지원하기 위해, 모델이 논문 전체와 실험, 글쓰기와 같은 지정된 관점(perspective)을 바탕으로 단일 집중 코멘트를 생성해야 하는 '관점 조건 세그먼트 수준 평가 피드백 생성'이라는 새로운 과제를 제안합니다. 또한 평가 세그먼트를 해당 세그먼트를 다루는 반론 세그먼트로 매핑하고, 관점 레이블과 저자의 수용 정도를 구분하는 영향 범주(impact category)를 포함하는 RMR-75K라는 대규모 데이터셋을 구축했습니다. 그런 다음 Llama-3.1-8B-Instruct 모델을 평가 세그먼트에 대해 지도 미세 조정(supervised fine-tuning)으로 학습시킨 후, 반론에서 도출된 쌍을 사용한 선호도 최적화(preference optimization)를 수행합니다. 인간 전문가와 LLM-as-a-judge를 이용한 실험 결과, 강력한 베이스라인 대비 근거성과 관련성을 유지하면서 실행 가능성과 구체성 측면에서 일관된 향상을 보여주었습니다.

English

Large language models (LLMs) are increasingly used across the scientific workflow, including to draft peer-review reports. However, many AI-generated reviews are superficial and insufficiently actionable, leaving authors without concrete, implementable guidance and motivating the gap this work addresses. We propose RbtAct, which targets actionable review feedback generation and places existing peer review rebuttal at the center of learning. Rebuttals show which reviewer comments led to concrete revisions or specific plans, and which were only defended. Building on this insight, we leverage rebuttal as implicit supervision to directly optimize a feedback generator for actionability. To support this objective, we propose a new task called perspective-conditioned segment-level review feedback generation, in which the model is required to produce a single focused comment based on the complete paper and a specified perspective such as experiments and writing. We also build a large dataset named RMR-75K that maps review segments to the rebuttal segments that address them, with perspective labels and impact categories that order author uptake. We then train the Llama-3.1-8B-Instruct model with supervised fine-tuning on review segments followed by preference optimization using rebuttal derived pairs. Experiments with human experts and LLM-as-a-judge show consistent gains in actionability and specificity over strong baselines while maintaining grounding and relevance.

RbtAct: 실행 가능한 리뷰 피드백 생성을 위한 감독 학습으로서의 반박

RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation

초록

Support