ByAbhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao
116
6
Modern artificial intelligence (AI) systems are powered by foundation models.
This paper presents a new set of foundation models, called Llama 3. It is a
herd of language models that natively support multilinguality, coding,
reasoning, and tool usage. Our largest model is a dense Transformer with 405B
parameters and a context window of up to 128K tokens. This paper presents an
extensive empirical evaluation of Llama 3. We find that Llama 3 delivers
comparable quality to leading language models such as GPT-4 on a plethora of
tasks. We publicly release Llama 3, including pre-trained and post-trained
versions of the 405B parameter language model and our Llama Guard 3 model for
input and output safety. The paper also presents the results of experiments in
which we integrate image, video, and speech capabilities into Llama 3 via a
compositional approach. We observe this approach performs competitively with
the state-of-the-art on image, video, and speech recognition tasks. The
resulting models are not yet being broadly released as they are still under
development.
ByZhenghao Zhang, Junchao Liao, Menghao Li, Long Qin, Weizhi Wang
27
2
Recent advancements in Diffusion Transformer (DiT) have demonstrated
remarkable proficiency in producing high-quality video content. Nonetheless,
the potential of transformer-based diffusion models for effectively generating
videos with controllable motion remains an area of limited exploration. This
paper introduces Tora, the first trajectory-oriented DiT framework that
integrates textual, visual, and trajectory conditions concurrently for video
generation. Specifically, Tora consists of a Trajectory Extractor~(TE), a
Spatial-Temporal DiT, and a Motion-guidance Fuser~(MGF). The TE encodes
arbitrary trajectories into hierarchical spacetime motion patches with a 3D
video compression network. The MGF integrates the motion patches into the DiT
blocks to generate consistent videos following trajectories. Our design aligns
seamlessly with DiT's scalability, allowing precise control of video content's
dynamics with diverse durations, aspect ratios, and resolutions. Extensive
experiments demonstrate Tora's excellence in achieving high motion fidelity,
while also meticulously simulating the movement of the physical world. Page can
be found at https://ali-videoai.github.io/tora_video.
ByXi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Gosh, Luke Zettlemoyer, Armen Aghajanyan
22
5
We introduce MoMa, a novel modality-aware mixture-of-experts (MoE)
architecture designed for pre-training mixed-modal, early-fusion language
models. MoMa processes images and text in arbitrary sequences by dividing
expert modules into modality-specific groups. These groups exclusively process
designated tokens while employing learned routing within each group to maintain
semantically informed adaptivity. Our empirical results reveal substantial
pre-training efficiency gains through this modality-specific parameter
allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model,
featuring 4 text experts and 4 image experts, achieves impressive FLOPs
savings: 3.7x overall, with 2.6x for text and 5.2x for image processing
compared to a compute-equivalent dense baseline, measured by pre-training loss.
This outperforms the standard expert-choice MoE with 8 mixed-modal experts,
which achieves 3x overall FLOPs savings (3x for text, 2.8x for image).
Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs
savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination
hurts performance in causal inference due to increased sensitivity to router
accuracy. These results demonstrate MoMa's potential to significantly advance
the efficiency of mixed-modal, early-fusion language model pre-training, paving
the way for more resource-efficient and capable multimodal AI systems.
ByShanbo Cheng, Zhichao Huang, Tom Ko, Hang Li, Ningxin Peng, Lu Xu, Qini Zhang
18
8
In this paper, we present Cross Language Agent -- Simultaneous
Interpretation, CLASI, a high-quality and human-like Simultaneous Speech
Translation (SiST) System. Inspired by professional human interpreters, we
utilize a novel data-driven read-write strategy to balance the translation
quality and latency. To address the challenge of translating in-domain
terminologies, CLASI employs a multi-modal retrieving module to obtain relevant
information to augment the translation. Supported by LLMs, our approach can
generate error-tolerated translation by considering the input audio, historical
context, and retrieved information. Experimental results show that our system
outperforms other systems by significant margins. Aligned with professional
human interpreters, we evaluate CLASI with a better human evaluation metric,
valid information proportion (VIP), which measures the amount of information
that can be successfully conveyed to the listeners. In the real-world
scenarios, where the speeches are often disfluent, informal, and unclear, CLASI
achieves VIP of 81.3% and 78.0% for Chinese-to-English and English-to-Chinese
translation directions, respectively. In contrast, state-of-the-art commercial
or open-source systems only achieve 35.4% and 41.6%. On the extremely hard
dataset, where other systems achieve under 13% VIP, CLASI can still achieve 70%
VIP.
ByWenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, Olivia Sturman, Oscar Wahltinez
14
3
We present ShieldGemma, a comprehensive suite of LLM-based safety content
moderation models built upon Gemma2. These models provide robust,
state-of-the-art predictions of safety risks across key harm types (sexually
explicit, dangerous content, harassment, hate speech) in both user input and
LLM-generated output. By evaluating on both public and internal benchmarks, we
demonstrate superior performance compared to existing models, such as Llama
Guard (+10.8\% AU-PRC on public benchmarks) and WildCard (+4.3\%).
Additionally, we present a novel LLM-based data curation pipeline, adaptable to
a variety of safety-related tasks and beyond. We have shown strong
generalization performance for model trained mainly on synthetic data. By
releasing ShieldGemma, we provide a valuable resource to the research
community, advancing LLM safety and enabling the creation of more effective
content moderation solutions for developers.
The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant
aspects of data contamination in natural language processing, where data
contamination is understood as situations where evaluation data is included in
pre-training corpora used to train large scale models, compromising evaluation
results. The workshop fostered a shared task to collect evidence on data
contamination in current available datasets and models. The goal of the shared
task and associated database is to assist the community in understanding the
extent of the problem and to assist researchers in avoiding reporting
evaluation results on known contaminated resources. The shared task provides a
structured, centralized public database for the collection of contamination
evidence, open to contributions from the community via GitHub pool requests.
This first compilation paper is based on 566 reported entries over 91
contaminated sources from a total of 23 contributors. The details of the
individual contamination events are available in the platform. The platform
continues to be online, open to contributions from the community.
Audio-visual semantic segmentation (AVSS) aims to segment and classify
sounding objects in videos with acoustic cues. However, most approaches operate
on the close-set assumption and only identify pre-defined categories from
training data, lacking the generalization ability to detect novel categories in
practical applications. In this paper, we introduce a new task: open-vocabulary
audio-visual semantic segmentation, extending AVSS task to open-world scenarios
beyond the annotated label space. This is a more challenging task that requires
recognizing all categories, even those that have never been seen nor heard
during training. Moreover, we propose the first open-vocabulary AVSS framework,
OV-AVSS, which mainly consists of two parts: 1) a universal sound source
localization module to perform audio-visual fusion and locate all potential
sounding objects and 2) an open-vocabulary classification module to predict
categories with the help of the prior knowledge from large-scale pre-trained
vision-language models. To properly evaluate the open-vocabulary AVSS, we split
zero-shot training and testing subsets based on the AVSBench-semantic
benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong
segmentation and zero-shot generalization ability of our model on all
categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base
categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art
zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%.
The code is available at https://github.com/ruohaoguo/ovavss.
ByGabriel Loiseau, Damien Sileo, Damien Riquet, Maxime Meyer, Marc Tommasi
8
2
Authorship obfuscation aims to disguise the identity of an author within a
text by altering the writing style, vocabulary, syntax, and other linguistic
features associated with the text author. This alteration needs to balance
privacy and utility. While strong obfuscation techniques can effectively hide
the author's identity, they often degrade the quality and usefulness of the
text for its intended purpose. Conversely, maintaining high utility tends to
provide insufficient privacy, making it easier for an adversary to de-anonymize
the author. Thus, achieving an optimal trade-off between these two conflicting
objectives is crucial. In this paper, we propose TAROT: Task-Oriented
Authorship Obfuscation Using Policy Optimization, a new unsupervised authorship
obfuscation method whose goal is to optimize the privacy-utility trade-off by
regenerating the entire text considering its downstream utility. Our approach
leverages policy optimization as a fine-tuning paradigm over small language
models in order to rewrite texts by preserving author identity and downstream
task utility. We show that our approach largely reduce the accuracy of
attackers while preserving utility. We make our code and models publicly
available.
We introduce Berkeley Humanoid, a reliable and low-cost mid-scale humanoid
research platform for learning-based control. Our lightweight, in-house-built
robot is designed specifically for learning algorithms with low simulation
complexity, anthropomorphic motion, and high reliability against falls. The
robot's narrow sim-to-real gap enables agile and robust locomotion across
various terrains in outdoor environments, achieved with a simple reinforcement
learning controller using light domain randomization. Furthermore, we
demonstrate the robot traversing for hundreds of meters, walking on a steep
unpaved trail, and hopping with single and double legs as a testimony to its
high performance in dynamical walking. Capable of omnidirectional locomotion
and withstanding large perturbations with a compact setup, our system aims for
scalable, sim-to-real deployment of learning-based humanoid systems. Please
check http://berkeley-humanoid.com for more details.
Facial expression and hand motions are necessary to express our emotions and
interact with the world. Nevertheless, most of the 3D human avatars modeled
from a casually captured video only support body motions without facial
expressions and hand motions.In this work, we present ExAvatar, an expressive
whole-body 3D human avatar learned from a short monocular video. We design
ExAvatar as a combination of the whole-body parametric mesh model (SMPL-X) and
3D Gaussian Splatting (3DGS). The main challenges are 1) a limited diversity of
facial expressions and poses in the video and 2) the absence of 3D
observations, such as 3D scans and RGBD images. The limited diversity in the
video makes animations with novel facial expressions and poses non-trivial. In
addition, the absence of 3D observations could cause significant ambiguity in
human parts that are not observed in the video, which can result in noticeable
artifacts under novel motions. To address them, we introduce our hybrid
representation of the mesh and 3D Gaussians. Our hybrid representation treats
each 3D Gaussian as a vertex on the surface with pre-defined connectivity
information (i.e., triangle faces) between them following the mesh topology of
SMPL-X. It makes our ExAvatar animatable with novel facial expressions by
driven by the facial expression space of SMPL-X. In addition, by using
connectivity-based regularizers, we significantly reduce artifacts in novel
facial expressions and poses.
ByYuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang, Jan Eric Lenssen
7
3
Current visual foundation models are trained purely on unstructured 2D data,
limiting their understanding of 3D structure of objects and scenes. In this
work, we show that fine-tuning on 3D-aware data improves the quality of
emerging semantic features. We design a method to lift semantic 2D features
into an efficient 3D Gaussian representation, which allows us to re-render them
for arbitrary views. Using the rendered 3D-aware features, we design a
fine-tuning strategy to transfer such 3D awareness into a 2D foundation model.
We demonstrate that models fine-tuned in that way produce features that readily
improve downstream task performance in semantic segmentation and depth
estimation through simple linear probing. Notably, though fined-tuned on a
single indoor dataset, the improvement is transferable to a variety of indoor
datasets and out-of-domain datasets. We hope our study encourages the community
to consider injecting 3D awareness when training 2D foundation models. Project
page: https://ywyue.github.io/FiT3D.
Incorporating a temporal dimension into pretrained image diffusion models for
video generation is a prevalent approach. However, this method is
computationally demanding and necessitates large-scale video datasets. More
critically, the heterogeneity between image and video datasets often results in
catastrophic forgetting of the image expertise. Recent attempts to directly
extract video snippets from image diffusion models have somewhat mitigated
these problems. Nevertheless, these methods can only generate brief video clips
with simple movements and fail to capture fine-grained motion or non-grid
deformation. In this paper, we propose a novel Zero-Shot video Sampling
algorithm, denoted as ZS^2, capable of directly sampling
high-quality video clips from existing image synthesis methods, such as Stable
Diffusion, without any training or optimization. Specifically, ZS^2
utilizes the dependency noise model and temporal momentum attention to ensure
content consistency and animation coherence, respectively. This ability enables
it to excel in related tasks, such as conditional and context-specialized video
generation and instruction-guided video editing. Experimental results
demonstrate that ZS^2 achieves state-of-the-art performance in
zero-shot video generation, occasionally outperforming recent supervised
methods.
Homepage: https://densechen.github.io/zss/.
Neural fields excel in computer vision and robotics due to their ability to
understand the 3D visual world such as inferring semantics, geometry, and
dynamics. Given the capabilities of neural fields in densely representing a 3D
scene from 2D images, we ask the question: Can we scale their self-supervised
pretraining, specifically using masked autoencoders, to generate effective 3D
representations from posed RGB images. Owing to the astounding success of
extending transformers to novel data modalities, we employ standard 3D Vision
Transformers to suit the unique formulation of NeRFs. We leverage NeRF's
volumetric grid as a dense input to the transformer, contrasting it with other
3D representations such as pointclouds where the information density can be
uneven, and the representation is irregular. Due to the difficulty of applying
masked autoencoders to an implicit representation, such as NeRF, we opt for
extracting an explicit representation that canonicalizes scenes across domains
by employing the camera trajectory for sampling. Our goal is made possible by
masking random patches from NeRF's radiance and density grid and employing a
standard 3D Swin Transformer to reconstruct the masked patches. In doing so,
the model can learn the semantic and spatial structure of complete scenes. We
pretrain this representation at scale on our proposed curated posed-RGB data,
totaling over 1.8 million images. Once pretrained, the encoder is used for
effective 3D transfer learning. Our novel self-supervised pretraining for
NeRFs, NeRF-MAE, scales remarkably well and improves performance on various
challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining,
NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF
scene understanding baselines on Front3D and ScanNet datasets with an absolute
performance improvement of over 20% AP50 and 8% AP25 for 3D object detection.