ChatPaper.aiChatPaper

字节是你所需的一切:变压器直接操作文件字节

Bytes Are All You Need: Transformers Operating Directly On File Bytes

May 31, 2023
作者: Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari
cs.AI

摘要

现代深度学习方法通常将输入转换为特定形式。例如,图像分类中最常见的深度学习方法涉及将图像文件字节解码为RGB张量,然后传入神经网络。相反,我们研究直接在文件字节上执行分类,无需在推断时解码文件。使用文件字节作为模型输入使得能够开发可以处理多种输入形式的模型。我们的模型ByteFormer,在TIFF文件字节上直接训练和测试,采用类似于DeiT-Ti的transformer骨干结构配置,实现了77.33%的ImageNet Top-1分类准确率(在RGB图像上操作时为72.2%准确率)。无需修改或超参数调整,ByteFormer在Speech Commands v2数据集的WAV文件上操作时,实现了95.42%的分类准确率(与98.7%的最新准确率相比)。此外,我们展示了ByteFormer在隐私保护推断方面的应用。ByteFormer能够在特定混淆的输入表示上执行推断,而不会损失准确性。我们还展示了ByteFormer在隐私保护相机上执行推断的能力,该相机通过始终屏蔽90%的像素通道而避免形成完整图像,仍然在ImageNet上实现了71.35%的准确率。我们的代码将在https://github.com/apple/ml-cvnets/tree/main/examples/byteformer 上提供。
English
Modern deep learning approaches usually transform inputs into a modality-specific form. For example, the most common deep learning approach to image classification involves decoding image file bytes into an RGB tensor which is passed into a neural network. Instead, we investigate performing classification directly on file bytes, without the need for decoding files at inference time. Using file bytes as model inputs enables the development of models which can operate on multiple input modalities. Our model, ByteFormer, achieves an ImageNet Top-1 classification accuracy of 77.33% when training and testing directly on TIFF file bytes using a transformer backbone with configuration similar to DeiT-Ti (72.2% accuracy when operating on RGB images). Without modifications or hyperparameter tuning, ByteFormer achieves 95.42% classification accuracy when operating on WAV files from the Speech Commands v2 dataset (compared to state-of-the-art accuracy of 98.7%). Additionally, we demonstrate that ByteFormer has applications in privacy-preserving inference. ByteFormer is capable of performing inference on particular obfuscated input representations with no loss of accuracy. We also demonstrate ByteFormer's ability to perform inference with a hypothetical privacy-preserving camera which avoids forming full images by consistently masking 90% of pixel channels, while still achieving 71.35% accuracy on ImageNet. Our code will be made available at https://github.com/apple/ml-cvnets/tree/main/examples/byteformer.
PDF60December 15, 2024