SEAOTTER: 효율적 재구성을 위한 일회성 트랜스코딩 기반 센서 임베디드 오토인코딩

초록

로봇 시스템에서는 저렴하고 저전력 하드웨어를 사용하여 고해상도의 방대한 시각 데이터를 손쉽게 수집할 수 있다. 그러나 제한된 대역폭과 온디바이스 컴퓨팅 자원으로 인해 JPEG/MPEG과 같은 기존 코덱을 통해 전송할 경우 완전한 활용이 어렵다. AV1/AVIF와 같은 최신 코덱은 율-왜곡 트레이드오프를 개선하지만, 인코딩에 훨씬 더 많은 자원을 요구하여 맞춤형 ASIC 없이는 실용적이지 않다. 최근의 비대칭 오토인코더는 극도의 전력 및 대역폭 제약 하에서 높은 품질을 제공하지만, 복호화 비용이 과도하게 증가하고 JPEG과 같은 표준을 기반으로 구축된 수십 년간의 인프라를 무시하는 독자적인 형식을 사용한다. 이러한 한계를 극복하기 위해, 본 연구에서는 센서 내장 오토인코더와 일회성 변환을 통한 효율적 복원(Sensor Embedded Autoencoder paired with a One-Time Transcode for Efficient Reconstruction, SEAOTTER)에 기반한 클라우드 로봇용 압축 프레임워크를 소개한다. 센서, 클라우드, 소비자 단계가 매우 다른 전력 및 대역폭 예산에 직면하기 때문에, SEAOTTER는 학습된 잠재 표현의 간결성과 표준 JPEG 파일의 광범위한 사용성을 결합한다. 단순한 트랜스코딩은 성능을 저하시키므로, 본 연구에서는 학습 가능한 JPEG 색상 및 양자화 변환을 제안하여 전역적, 밀집적, 시각-언어 기반 인식에 대한 정확도를 향상시킨다. SEAOTTER를 사용하여 사전 학습되고 고정된 인코더에 대해 범용 및 작업 인식 트랜스코딩 파이프라인을 모두 학습시킨다. 200:1의 압축 비율에서 AVIF와 비교하여 인코딩 속도 7배, 디코딩 속도 3.5배 향상, ImageNet Top-1 정확도 +8%를 달성하면서도 JPEG 인프라와의 호환성을 유지한다. 코드는 https://github.com/UT-SysML/seaotter 에서 확인할 수 있다.

English

In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-power hardware. Yet, limited bandwidth and on-device compute resources prevent full utilization when transmitted via conventional codecs like JPEG/MPEG. Newer codecs, like AV1/AVIF, improve the rate-distortion trade-off, but demand far more resources for encoding, impractical without custom ASICs. Recent asymmetric autoencoders deliver high quality under extreme power and bandwidth constraints, but add prohibitive decoding cost and use bespoke formats that ignore decades of infrastructure built around standards like JPEG. To address these limitations, we introduce a compression framework for cloud robotics based on a Sensor Embedded Autoencoder paired with a One-Time Transcode for Efficient Reconstruction (SEAOTTER). Because the sensor, cloud, and consumer stages face very different power and bandwidth budgets, SEAOTTER combines the compactness of a learned latent with the broad usability of a standard JPEG file. Since naive transcoding degrades performance, we propose a learnable JPEG color and quantization transform that enables increased accuracy for global, dense, and vision-language-based perception. Using SEAOTTER, we train both general-purpose and task-aware transcoding pipelines for a pre-trained, frozen encoder. At a compression ratio of 200:1 and compared to AVIF, we observe 7 times faster encoding, 3.5 times faster decoding, and +8% ImageNet top-1 accuracy, while retaining compatibility with JPEG infrastructure. Our code is available at https://github.com/UT-SysML/seaotter .