ChatPaper.aiChatPaper

字節就是你所需的一切:直接在文件字節上運行的Transformer

Bytes Are All You Need: Transformers Operating Directly On File Bytes

May 31, 2023
作者: Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari
cs.AI

摘要

現代深度學習方法通常將輸入轉換為特定於模態的形式。例如,對於圖像分類,最常見的深度學習方法涉及將圖像文件位元組解碼為RGB張量,然後將其傳遞到神經網絡中。相反,我們研究在推論時直接對文件位元組執行分類,而無需解碼文件。使用文件位元組作為模型輸入使得能夠開發可以處理多種輸入模態的模型。我們的模型ByteFormer 在 TIFF 文件位元組上進行訓練和測試,使用了類似於 DeiT-Ti 的變壓器骨幹結構,實現了 ImageNet Top-1 分類準確率為 77.33%(在 RGB 圖像上操作時的準確率為 72.2%)。在不進行修改或超參數調整的情況下,ByteFormer 在 Speech Commands v2 數據集的 WAV 文件上實現了 95.42% 的分類準確率(與當前最先進的 98.7% 準確率相比)。此外,我們展示了 ByteFormer 在隱私保護推論方面的應用。ByteFormer 能夠對特定混淆的輸入表示執行推論,而不會損失準確性。我們還展示了 ByteFormer 在具有假設的隱私保護相機上執行推論的能力,該相機通過持續遮蔽 90% 的像素通道而避免形成完整圖像,同時在 ImageNet 上實現 71.35% 的準確率。我們的代碼將在 https://github.com/apple/ml-cvnets/tree/main/examples/byteformer 上提供。
English
Modern deep learning approaches usually transform inputs into a modality-specific form. For example, the most common deep learning approach to image classification involves decoding image file bytes into an RGB tensor which is passed into a neural network. Instead, we investigate performing classification directly on file bytes, without the need for decoding files at inference time. Using file bytes as model inputs enables the development of models which can operate on multiple input modalities. Our model, ByteFormer, achieves an ImageNet Top-1 classification accuracy of 77.33% when training and testing directly on TIFF file bytes using a transformer backbone with configuration similar to DeiT-Ti (72.2% accuracy when operating on RGB images). Without modifications or hyperparameter tuning, ByteFormer achieves 95.42% classification accuracy when operating on WAV files from the Speech Commands v2 dataset (compared to state-of-the-art accuracy of 98.7%). Additionally, we demonstrate that ByteFormer has applications in privacy-preserving inference. ByteFormer is capable of performing inference on particular obfuscated input representations with no loss of accuracy. We also demonstrate ByteFormer's ability to perform inference with a hypothetical privacy-preserving camera which avoids forming full images by consistently masking 90% of pixel channels, while still achieving 71.35% accuracy on ImageNet. Our code will be made available at https://github.com/apple/ml-cvnets/tree/main/examples/byteformer.
PDF60December 15, 2024