SEAOTTER: ワンタイムトランスコードを用いた効率的再構成のためのセンサー埋め込みオートエンコーディング

要旨

ロボティクスシステムでは、低コスト・低消費電力のハードウェアを用いて、高解像度のビジュアルデータを容易に大量に取得できる。しかし、帯域幅やオンデバイス計算リソースが限られているため、JPEG/MPEGなどの従来のコーデックで伝送する際に十分に活用することができない。AV1/AVIFのような新しいコーデックはレート歪みトレードオフを改善するが、エンコードに遥かに多くのリソースを必要とし、カスタムASICなしでは実用的でない。近年の非対称オートエンコーダは、極度の電力・帯域制約下で高品質を実現するが、デコードコストが法外であり、JPEGなどの標準規格を中心に構築された数十年にわたるインフラを無視した独自形式を使用する。これらの限界に対処するため、我々は、効率的な再構成のためのワンタイムトランスコードと組み合わされたセンサー組み込みオートエンコーダ（SEAOTTER）に基づく、クラウドロボティクス向け圧縮フレームワークを提案する。センサ、クラウド、コンシューマの各段階では、電力と帯域幅の制約が大きく異なるため、SEAOTTERは学習された潜在表現のコンパクトさと、標準JPEGファイルの広範な有用性を組み合わせる。単純なトランスコードでは性能が低下するため、我々は学習可能なJPEG色空間・量子化変換を提案し、これによりグローバル、密、および視覚言語ベースの知覚において精度を向上させる。SEAOTTERを用いて、事前学習済みで固定されたエンコーダに対して、汎用およびタスク認識のトランスコードパイプラインを訓練する。200:1の圧縮率において、AVIFと比較して、エンコード速度が7倍、デコード速度が3.5倍高速であり、ImageNet top-1精度が+8%向上し、JPEGインフラとの互換性を維持することを確認した。我々のコードは https://github.com/UT-SysML/seaotter で入手可能である。

English

In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-power hardware. Yet, limited bandwidth and on-device compute resources prevent full utilization when transmitted via conventional codecs like JPEG/MPEG. Newer codecs, like AV1/AVIF, improve the rate-distortion trade-off, but demand far more resources for encoding, impractical without custom ASICs. Recent asymmetric autoencoders deliver high quality under extreme power and bandwidth constraints, but add prohibitive decoding cost and use bespoke formats that ignore decades of infrastructure built around standards like JPEG. To address these limitations, we introduce a compression framework for cloud robotics based on a Sensor Embedded Autoencoder paired with a One-Time Transcode for Efficient Reconstruction (SEAOTTER). Because the sensor, cloud, and consumer stages face very different power and bandwidth budgets, SEAOTTER combines the compactness of a learned latent with the broad usability of a standard JPEG file. Since naive transcoding degrades performance, we propose a learnable JPEG color and quantization transform that enables increased accuracy for global, dense, and vision-language-based perception. Using SEAOTTER, we train both general-purpose and task-aware transcoding pipelines for a pre-trained, frozen encoder. At a compression ratio of 200:1 and compared to AVIF, we observe 7 times faster encoding, 3.5 times faster decoding, and +8% ImageNet top-1 accuracy, while retaining compatibility with JPEG infrastructure. Our code is available at https://github.com/UT-SysML/seaotter .