FRAPPE: 프로젝션 퍼슈트 인코더를 이용한 완전 입력, 잔차 출력 오토인코딩

초록

미디어 압축 표준은 율-왜곡-복잡도 트레이드오프 측면에서 정체 상태에 이르렀으며, 이는 로보틱스, 웨어러블, 원격 탐사와 같은 응용 분야에서 고비용 AI 인식을 클라우드에 오프로드하는 능력을 제한한다. DNN 기반 코덱은 압축 효율을 향상시키지만, 사용 가능한 비트레이트의 큰 변화에 쉽게 적응하지 못하고, 실시간 인코딩에는 고비용·고전력 GPU가 필요하여 저비용 또는 자원 제약 플랫폼에서의 사용이 불가능하다는 단점이 있다. 이러한 한계를 해결하기 위해, 우리는 전체 입력(Full input)을 사용하여 투사 추적 인코더(Projection Pursuit Encoder)를 통해 잔차 출력(Residual output)을 예측하는 새로운 자동 인코딩 프레임워크(FRAPPE)를 제안한다. FRAPPE의 인코딩 목적 함수는 자연스럽게 잠재 채널을 중요도 순으로 정렬하여 오버헤드 없는 가변 비트율 코딩을 가능하게 한다. 이전 재구성의 잔차를 소비하는 RNN 기반 학습형 코덱이나 코드북을 순차적으로 적용해야 하는 RVQ 방식 코덱과 달리, FRAPPE의 분석 경로는 독립적인 입력 투사들로 구성된 손쉽게 병렬화 가능한 DAG이다. FRAPPE를 사용하여 가변 비트율 RGB 이미지 코덱(FRAPPE-Image)을 구축하고, 표준 이미지 코덱과의 율-왜곡-복잡도 트레이드오프를 평가한다. 높은 압축률(약 0.1 bpp)에서 FRAPPE-Image는 AVIF보다 47배 빠른 인코딩 속도로 더 높은 지각적 품질을 제공하며, 이는 CPU 전용으로 1080p, 30fps의 실시간 인코딩을 가능하게 한다. 우리의 코드와 사전 훈련된 모델은 https://github.com/UT-SysML/FRAPPE 에서 확인할 수 있다.

English

Media compression standards have reached a plateau in terms of the rate-distortion-complexity trade-off, limiting the ability to offload expensive AI perception to the cloud in applications like robotics, wearables, and remote sensing. DNN-based codecs improve compression efficiency, but at a cost: they cannot easily adapt to large changes in available bitrate, and real-time encoding requires expensive, power-hungry GPUs that prohibit use on low-cost or resource-constrained platforms. To address these limitations, we propose a novel autoencoding framework (FRAPPE) that uses the Full input to predict the Residual output via a Projection Pursuit Encoder. FRAPPE's encoding objective naturally sorts latent channels by importance, allowing zero-overhead variable-rate coding. Unlike RNN-based learned codecs, whose encoder consumes the previous reconstruction's residual, or RVQ-style codecs, whose codebooks must be applied sequentially, FRAPPE's analysis path is an embarrassingly parallel DAG of independent input projections. Using FRAPPE, we build a variable-rate RGB image codec (FRAPPE-Image), and evaluate its rate-distortion-complexity trade-off against standard image codecs. At high compression ratios (approx. 0.1 bpp) FRAPPE-Image provides higher perceptual quality than AVIF with 47 times faster encoding, making it capable of real-time 1080p, 30fps CPU-only encoding. Our code and pre-trained models are available: https://github.com/UT-SysML/FRAPPE .