FRAPPE: 射影追跡エンコーダを用いた完全入力・残差出力自己符号化

要旨

メディア圧縮規格は、レート歪み複雑性のトレードオフにおいて頭打ち状態に達しており、ロボティクス、ウェアラブル、リモートセンシングといったアプリケーションにおいて、高コストなAI認識処理をクラウドにオフロードする能力が制限されている。DNNベースのコーデックは圧縮効率を向上させるものの、利用可能なビットレートの大きな変動に容易に適応できず、リアルタイム符号化には高価で消費電力の大きいGPUが必要となり、低コストまたはリソース制約のあるプラットフォームでの使用が不可能になるという代償を伴う。これらの制約に対処するため、我々は、フル入力（Full input）を用いて射影追跡エンコーダ（Projection Pursuit Encoder）を介して残差出力（Residual output）を予測する、新しい自動符号化フレームワーク（FRAPPE）を提案する。FRAPPEの符号化目的関数は、潜在チャネルを重要度に応じて自然に並べ替え、オーバーヘッドゼロの可変レート符号化を可能にする。RNNベースの学習コーデック（エンコーダが前の再構成の残差を消費する）やRVQスタイルのコーデック（コードブックを逐次的に適用する必要がある）とは異なり、FRAPPEの解析パスは、独立した入力射影からなる、容易に並列化可能なDAG（有向非巡回グラフ）である。FRAPPEを用いて、可変レートRGB画像コーデック（FRAPPE-Image）を構築し、標準的な画像コーデックとのレート歪み複雑性のトレードオフを評価した。高い圧縮率（約0.1 bpp）において、FRAPPE-ImageはAVIFよりも高い知覚品質を提供し、かつ47倍高速な符号化を実現し、CPUのみでリアルタイム1080p、30fpsの符号化が可能である。我々のコードと事前学習済みモデルは以下で入手できる：https://github.com/UT-SysML/FRAPPE

English

Media compression standards have reached a plateau in terms of the rate-distortion-complexity trade-off, limiting the ability to offload expensive AI perception to the cloud in applications like robotics, wearables, and remote sensing. DNN-based codecs improve compression efficiency, but at a cost: they cannot easily adapt to large changes in available bitrate, and real-time encoding requires expensive, power-hungry GPUs that prohibit use on low-cost or resource-constrained platforms. To address these limitations, we propose a novel autoencoding framework (FRAPPE) that uses the Full input to predict the Residual output via a Projection Pursuit Encoder. FRAPPE's encoding objective naturally sorts latent channels by importance, allowing zero-overhead variable-rate coding. Unlike RNN-based learned codecs, whose encoder consumes the previous reconstruction's residual, or RVQ-style codecs, whose codebooks must be applied sequentially, FRAPPE's analysis path is an embarrassingly parallel DAG of independent input projections. Using FRAPPE, we build a variable-rate RGB image codec (FRAPPE-Image), and evaluate its rate-distortion-complexity trade-off against standard image codecs. At high compression ratios (approx. 0.1 bpp) FRAPPE-Image provides higher perceptual quality than AVIF with 47 times faster encoding, making it capable of real-time 1080p, 30fps CPU-only encoding. Our code and pre-trained models are available: https://github.com/UT-SysML/FRAPPE .