에이전트 기반 계획-실행 파이프라인에서 시간적 의미 캐싱 및 워크플로 최적화 평가

초록

산업 자산 운영 워크플로우는 지연 시간에 민감한데, 이는 단일 사용자 질의가 센서 데이터, 작업 지시, 고장 모드, 예측 도구 및 도메인별 에이전트 간의 조정을 필요로 할 수 있기 때문이다. 우리는 이 문제를 산업 에이전트 벤치마크인 AssetOpsBench(AOB)에서 평가하며, 해당 벤치마크의 계획-실행 파이프라인은 도구 탐색, LLM 계획, MCP 도구 실행 및 최종 요약 과정에서 반복적인 오버헤드를 노출시킨다. 기존의 LLM 캐싱 기법(예: KV-캐시 재사용 및 임베딩 기반 의미적 캐싱)은 챗봇 서빙을 위해 설계되었으며, 출력 유효성이 시간, 자산 또는 센서 매개변수에 의존할 때 그 효과가 떨어진다. 우리는 AOB 계획-실행 파이프라인을 위한 두 가지 상호 보완적 최적화 계층을 제안한다: 시간적 의미 캐시(temporal semantic cache)와 디스크 백업 도구 탐색 캐싱 및 의존성 인식 병렬 단계 실행을 결합한 MCP 워크플로우 최적화 집합이다. MCP 워크플로우 최적화는 1.67배 속도 향상을 가져왔고 중앙값 종단 간 지연 시간을 약 40.0% 감소시켰으며, 시간적 캐시 벤치마크는 캐시 적중 시 중앙값 30.6배의 속도 향상을 달성했다. 속도 향상 외에도, 우리의 결과는 매개변수가 풍부한 산업 질의에 대한 순수 의미적 캐싱의 구체적인 실패 모드를 드러내며, MCP 기반 에이전트 벤치마크에서 캐싱 선택이 평가 정확성과 어떻게 상호작용하는지에 대한 비판적 분석을 제공한다.

English

Industrial asset operations workflows are latency-sensitive because a single user query may require coordination over sensor data, work orders, failure modes, forecasting tools, and domain-specific agents. We evaluate this problem on AssetOpsBench (AOB), an industrial agent benchmark whose plan-execute pipeline exposes repeated overhead from tool discovery, LLM planning, MCP tool execution, and final summarization. Existing LLM caching techniques such as KV-cache reuse and embedding-based semantic caching were designed for chatbot serving and break down when output validity depends on time, asset, or sensor parameters. We propose two complementary optimization layers for AOB plan-execute pipelines: a temporal semantic cache and a set of MCP workflow optimizations combining disk-backed tool-discovery caching and dependency-aware parallel step execution. MCP workflow optimizations corresponded to a 1.67x speedup and reduced median end-to-end latency by about 40.0% while the temporal-cache benchmark achieved a median of 30.6x speedup on cache hits. Beyond the speedup, our results expose a concrete failure mode of pure semantic caching for parameter-rich industrial queries, providing a critical analysis of how caching choices interact with evaluation correctness in MCP-backed agent benchmarks.