신호: 에이전트 상호작용을 위한 궤적 샘플링 및 분류

초록

대규모 언어 모델 기반의 에이전트 응용 프로그램은 계획, 행동 실행, 환경 피드백을 포함하는 다단계 상호작용 루프에 점점 더 의존하고 있습니다. 이러한 시스템이 이제 대규모로 배포되고 있지만, 배포 후 개선은 여전히 어려운 과제로 남아 있습니다. 에이전트 궤적은 방대하고 비결정적이며, 각 궤적을 인간 검토나 보조 LLM을 통해 검토하는 것은 속도가 느리고 비용이 많이 듭니다. 우리는 에이전트 상호작용 궤적을 분류하기 위한 경량의 신호 기반 프레임워크를 제안합니다. 우리의 접근 방식은 실시간 상호작용에서 저렴하고 광범위하게 적용 가능한 신호를 계산하여 구조화된 속성으로 첨부함으로써, 온라인 에이전트 동작에 영향을 주지 않으면서 유의미할 가능성이 높은 상호작용을 식별하는 궤적 분류를 수행합니다. 우리는 신호를 상호작용(불일치, 정체, 비참여, 만족), 실행(실패, 루프), 환경(고갈)에 걸친 coarse-grained 분류체계로 구성하며, 모델 호출 없이 계산할 수 있도록 설계했습니다. 도구 강화 에이전트 평가를 위해 널리 사용되는 벤치마크인 τ-bench에서 진행한 통제된 주석 연구에서, 신호 기반 샘플링은 휴리스틱 필터링(74%) 및 무작위 샘플링(54%) 대비 82%의 유의미성 비율을 달성했으며, 유의미한 궤적당 효율성이 1.52배 향상되었음을 보여줍니다. 이 장점은 보상 계층과 작업 영역 전반에 걸쳐 견고하며, 신호가 단순히 명백한 실패를 과도하게 샘플링하는 것이 아니라 궤적별 진정한 유의미성 향상을 제공함을 확인합니다. 이러한 결과는 경량 신호가 에이전트 시스템을 위한 실용적인 샘플링 인프라로 기능할 수 있음을 보여주며, 선호도 데이터 구축과 배포 후 최적화를 위한 길을 제시합니다.

English

Agentic applications based on large language models increasingly rely on multi-step interaction loops involving planning, action execution, and environment feedback. While such systems are now deployed at scale, improving them post-deployment remains challenging. Agent trajectories are voluminous and non-deterministic, and reviewing each one, whether through human review or auxiliary LLMs, is slow and cost-prohibitive. We propose a lightweight, signal-based framework for triaging agentic interaction trajectories. Our approach computes cheap, broadly applicable signals from live interactions and attaches them as structured attributes for trajectory triage, identifying interactions likely to be informative without affecting online agent behavior. We organize signals into a coarse-grained taxonomy spanning interaction (misalignment, stagnation, disengagement, satisfaction), execution (failure, loop), and environment (exhaustion), designed for computation without model calls. In a controlled annotation study on τ-bench, a widely used benchmark for tool-augmented agent evaluation, we show that signal-based sampling achieves an 82\% informativeness rate compared to 74\% for heuristic filtering and 54\% for random sampling, with a 1.52x efficiency gain per informative trajectory. The advantage is robust across reward strata and task domains, confirming that signals provide genuine per-trajectory informativeness gains rather than merely oversampling obvious failures. These results show that lightweight signals can serve as practical sampling infrastructure for agentic systems, and suggest a path toward preference data construction and post-deployment optimization.

신호: 에이전트 상호작용을 위한 궤적 샘플링 및 분류

Signals: Trajectory Sampling and Triage for Agentic Interactions

초록

Support