ByRundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T. Barron, Aleksander Holynski
59
5
We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular
video. CAT4D leverages a multi-view video diffusion model trained on a diverse
combination of datasets to enable novel view synthesis at any specified camera
poses and timestamps. Combined with a novel sampling approach, this model can
transform a single monocular video into a multi-view video, enabling robust 4D
reconstruction via optimization of a deformable 3D Gaussian representation. We
demonstrate competitive performance on novel view synthesis and dynamic scene
reconstruction benchmarks, and highlight the creative capabilities for 4D scene
generation from real or generated videos. See our project page for results and
interactive demos: cat-4d.github.io.
GUIs have long been central to human-computer interaction, providing an
intuitive and visually-driven way to access and interact with digital systems.
The advent of LLMs, particularly multimodal models, has ushered in a new era of
GUI automation. They have demonstrated exceptional capabilities in natural
language understanding, code generation, and visual processing. This has paved
the way for a new generation of LLM-brained GUI agents capable of interpreting
complex GUI elements and autonomously executing actions based on natural
language instructions. These agents represent a paradigm shift, enabling users
to perform intricate, multi-step tasks through simple conversational commands.
Their applications span across web navigation, mobile app interactions, and
desktop automation, offering a transformative user experience that
revolutionizes how individuals interact with software. This emerging field is
rapidly advancing, with significant progress in both research and industry.
To provide a structured understanding of this trend, this paper presents a
comprehensive survey of LLM-brained GUI agents, exploring their historical
evolution, core components, and advanced techniques. We address research
questions such as existing GUI agent frameworks, the collection and utilization
of data for training specialized GUI agents, the development of large action
models tailored for GUI tasks, and the evaluation metrics and benchmarks
necessary to assess their effectiveness. Additionally, we examine emerging
applications powered by these agents. Through a detailed analysis, this survey
identifies key research gaps and outlines a roadmap for future advancements in
the field. By consolidating foundational knowledge and state-of-the-art
developments, this work aims to guide both researchers and practitioners in
overcoming challenges and unlocking the full potential of LLM-brained GUI
agents.
BySankalp Sinha, Mohammad Sadil Khan, Muhammad Usama, Shino Sam, Didier Stricker, Sk Aziz Ali, Muhammad Zeshan Afzal
21
4
Generating high-fidelity 3D content from text prompts remains a significant
challenge in computer vision due to the limited size, diversity, and annotation
depth of the existing datasets. To address this, we introduce MARVEL-40M+, an
extensive dataset with 40 million text annotations for over 8.9 million 3D
assets aggregated from seven major 3D datasets. Our contribution is a novel
multi-stage annotation pipeline that integrates open-source pretrained
multi-view VLMs and LLMs to automatically produce multi-level descriptions,
ranging from detailed (150-200 words) to concise semantic tags (10-20 words).
This structure supports both fine-grained 3D reconstruction and rapid
prototyping. Furthermore, we incorporate human metadata from source datasets
into our annotation pipeline to add domain-specific information in our
annotation and reduce VLM hallucinations. Additionally, we develop MARVEL-FX3D,
a two-stage text-to-3D pipeline. We fine-tune Stable Diffusion with our
annotations and use a pretrained image-to-3D network to generate 3D textured
meshes within 15s. Extensive evaluations show that MARVEL-40M+ significantly
outperforms existing datasets in annotation quality and linguistic diversity,
achieving win rates of 72.41% by GPT-4 and 73.40% by human evaluators.
ByShengqu Cai, Eric Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, Gordon Wetzstein
16
6
Text-to-image diffusion models produce impressive results but are frustrating
tools for artists who desire fine-grained control. For example, a common use
case is to create images of a specific instance in novel contexts, i.e.,
"identity-preserving generation". This setting, along with many other tasks
(e.g., relighting), is a natural fit for image+text-conditional generative
models. However, there is insufficient high-quality paired data to train such a
model directly. We propose Diffusion Self-Distillation, a method for using a
pre-trained text-to-image model to generate its own dataset for
text-conditioned image-to-image tasks. We first leverage a text-to-image
diffusion model's in-context generation ability to create grids of images and
curate a large paired dataset with the help of a Visual-Language Model. We then
fine-tune the text-to-image model into a text+image-to-image model using the
curated paired dataset. We demonstrate that Diffusion Self-Distillation
outperforms existing zero-shot methods and is competitive with per-instance
tuning techniques on a wide range of identity-preservation generation tasks,
without requiring test-time optimization.
ByJan Held, Renaud Vandeghen, Abdullah Hamdi, Adrien Deliege, Anthony Cioppa, Silvio Giancola, Andrea Vedaldi, Bernard Ghanem, Marc Van Droogenbroeck
15
5
Recent advances in radiance field reconstruction, such as 3D Gaussian
Splatting (3DGS), have achieved high-quality novel view synthesis and fast
rendering by representing scenes with compositions of Gaussian primitives.
However, 3D Gaussians present several limitations for scene reconstruction.
Accurately capturing hard edges is challenging without significantly increasing
the number of Gaussians, creating a large memory footprint. Moreover, they
struggle to represent flat surfaces, as they are diffused in space. Without
hand-crafted regularizers, they tend to disperse irregularly around the actual
surface. To circumvent these issues, we introduce a novel method, named 3D
Convex Splatting (3DCS), which leverages 3D smooth convexes as primitives for
modeling geometrically-meaningful radiance fields from multi-view images.
Smooth convex shapes offer greater flexibility than Gaussians, allowing for a
better representation of 3D scenes with hard edges and dense volumes using
fewer primitives. Powered by our efficient CUDA-based rasterizer, 3DCS achieves
superior performance over 3DGS on benchmarks such as Mip-NeRF360, Tanks and
Temples, and Deep Blending. Specifically, our method attains an improvement of
up to 0.81 in PSNR and 0.026 in LPIPS compared to 3DGS while maintaining high
rendering speeds and reducing the number of required primitives. Our results
highlight the potential of 3D Convex Splatting to become the new standard for
high-quality scene reconstruction and novel view synthesis. Project page:
convexsplatting.github.io.
Recently, the diffusion model has emerged as a powerful generative technique
for robotic policy learning, capable of modeling multi-mode action
distributions. Leveraging its capability for end-to-end autonomous driving is a
promising direction. However, the numerous denoising steps in the robotic
diffusion policy and the more dynamic, open-world nature of traffic scenes pose
substantial challenges for generating diverse driving actions at a real-time
speed. To address these challenges, we propose a novel truncated diffusion
policy that incorporates prior multi-mode anchors and truncates the diffusion
schedule, enabling the model to learn denoising from anchored Gaussian
distribution to the multi-mode driving action distribution. Additionally, we
design an efficient cascade diffusion decoder for enhanced interaction with
conditional scene context. The proposed model, DiffusionDrive, demonstrates
10times reduction in denoising steps compared to vanilla diffusion policy,
delivering superior diversity and quality in just 2 steps. On the
planning-oriented NAVSIM dataset, with the aligned ResNet-34 backbone,
DiffusionDrive achieves 88.1 PDMS without bells and whistles, setting a new
record, while running at a real-time speed of 45 FPS on an NVIDIA 4090.
Qualitative results on challenging scenarios further confirm that
DiffusionDrive can robustly generate diverse plausible driving actions. Code
and model will be available at https://github.com/hustvl/DiffusionDrive.
ByZhiyang Guo, Jinxu Xiang, Kai Ma, Wengang Zhou, Houqiang Li, Ran Zhang
14
4
3D characters are essential to modern creative industries, but making them
animatable often demands extensive manual work in tasks like rigging and
skinning. Existing automatic rigging tools face several limitations, including
the necessity for manual annotations, rigid skeleton topologies, and limited
generalization across diverse shapes and poses. An alternative approach is to
generate animatable avatars pre-bound to a rigged template mesh. However, this
method often lacks flexibility and is typically limited to realistic human
shapes. To address these issues, we present Make-It-Animatable, a novel
data-driven method to make any 3D humanoid model ready for character animation
in less than one second, regardless of its shapes and poses. Our unified
framework generates high-quality blend weights, bones, and pose
transformations. By incorporating a particle-based shape autoencoder, our
approach supports various 3D representations, including meshes and 3D Gaussian
splats. Additionally, we employ a coarse-to-fine representation and a
structure-aware modeling strategy to ensure both accuracy and robustness, even
for characters with non-standard skeleton structures. We conducted extensive
experiments to validate our framework's effectiveness. Compared to existing
methods, our approach demonstrates significant improvements in both quality and
speed.
ByYiheng Li, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen
13
4
Human pose plays a crucial role in the digital age. While recent works have
achieved impressive progress in understanding and generating human poses, they
often support only a single modality of control signals and operate in
isolation, limiting their application in real-world scenarios. This paper
presents UniPose, a framework employing Large Language Models (LLMs) to
comprehend, generate, and edit human poses across various modalities, including
images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to
convert 3D poses into discrete pose tokens, enabling seamless integration into
the LLM within a unified vocabulary. To further enhance the fine-grained pose
perception capabilities, we facilitate UniPose with a mixture of visual
encoders, among them a pose-specific visual encoder. Benefiting from a unified
learning strategy, UniPose effectively transfers knowledge across different
pose-relevant tasks, adapts to unseen tasks, and exhibits extended
capabilities. This work serves as the first attempt at building a
general-purpose framework for pose comprehension, generation, and editing.
Extensive experiments highlight UniPose's competitive and even superior
performance across various pose-relevant tasks.
ByZigeng Chen, Xinyin Ma, Gongfan Fang, Xinchao Wang
12
2
In the rapidly advancing field of image generation, Visual Auto-Regressive
(VAR) modeling has garnered considerable attention for its innovative
next-scale prediction approach. This paradigm offers substantial improvements
in efficiency, scalability, and zero-shot generalization. Yet, the inherently
coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to
prohibitive memory consumption and computational redundancies. To address these
bottlenecks, we propose Collaborative Decoding (CoDe), a novel efficient
decoding strategy tailored for the VAR framework. CoDe capitalizes on two
critical observations: the substantially reduced parameter demands at larger
scales and the exclusive generation patterns across different scales. Based on
these insights, we partition the multi-scale inference process into a seamless
collaboration between a large model and a small model. The large model serves
as the 'drafter', specializing in generating low-frequency content at smaller
scales, while the smaller model serves as the 'refiner', solely focusing on
predicting high-frequency details at larger scales. This collaboration yields
remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x
speedup, slashes memory usage by around 50%, and preserves image quality with
only a negligible FID increase from 1.95 to 1.98. When drafting steps are
further decreased, CoDe can achieve an impressive 2.9x acceleration ratio,
reaching 41 images/s at 256x256 resolution on a single NVIDIA 4090 GPU, while
preserving a commendable FID of 2.27. The code is available at
https://github.com/czg1225/CoDe
ByEmanuele Aiello, Umberto Michieli, Diego Valsesia, Mete Ozay, Enrico Magli
12
3
Personalized image generation requires text-to-image generative models that
capture the core features of a reference subject to allow for controlled
generation across different contexts. Existing methods face challenges due to
complex training requirements, high inference costs, limited flexibility, or a
combination of these issues. In this paper, we introduce DreamCache, a scalable
approach for efficient and high-quality personalized image generation. By
caching a small number of reference image features from a subset of layers and
a single timestep of the pretrained diffusion denoiser, DreamCache enables
dynamic modulation of the generated image features through lightweight, trained
conditioning adapters. DreamCache achieves state-of-the-art image and text
alignment, utilizing an order of magnitude fewer extra parameters, and is both
more computationally effective and versatile than existing models.
ByQing Jiang, Gen luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, Lei Zhang
10
3
Perception and understanding are two pillars of computer vision. While
multimodal large language models (MLLM) have demonstrated remarkable visual
understanding capabilities, they arguably lack accurate perception abilities,
e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on
the COCO dataset, limiting many tasks requiring the combination of perception
and understanding. In this work, we aim to bridge this perception gap from both
model designing and data development perspectives. We first introduce ChatRex,
an MLLM with a decoupled perception design. Instead of having the LLM directly
predict box coordinates, we feed the output boxes from a universal proposal
network into the LLM, allowing it to output the corresponding box indices to
represent its detection results, turning the regression task into a
retrieval-based task that LLM handles more proficiently. From the data
perspective, we build a fully automated data engine and construct the
Rexverse-2M dataset which possesses multiple granularities to support the joint
training of perception and understanding. After standard two-stage training,
ChatRex demonstrates strong perception capabilities while preserving multimodal
understanding performance. The combination of these two capabilities
simultaneously unlocks many attractive applications, demonstrating the
complementary roles of both perception and understanding in MLLM. Code is
available at https://github.com/IDEA-Research/ChatRex.
ByZiyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon
10
2
Generating sound effects for videos often requires creating artistic sound
effects that diverge significantly from real-life sources and flexible control
in the sound design. To address this problem, we introduce MultiFoley, a model
designed for video-guided sound generation that supports multimodal
conditioning through text, audio, and video. Given a silent video and a text
prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels
spinning without wind noise) or more whimsical sounds (e.g., making a lion's
roar sound like a cat's meow). MultiFoley also allows users to choose reference
audio from sound effects (SFX) libraries or partial videos for conditioning. A
key novelty of our model lies in its joint training on both internet video
datasets with low-quality audio and professional SFX recordings, enabling
high-quality, full-bandwidth (48kHz) audio generation. Through automated
evaluations and human studies, we demonstrate that MultiFoley successfully
generates synchronized high-quality sounds across varied conditional inputs and
outperforms existing methods. Please see our project page for video results:
https://ificl.github.io/MultiFoley/
In this work, we introduce a single parameter omega, to effectively
control granularity in diffusion-based synthesis. This parameter is
incorporated during the denoising steps of the diffusion model's reverse
process. Our approach does not require model retraining, architectural
modifications, or additional computational overhead during inference, yet
enables precise control over the level of details in the generated outputs.
Moreover, spatial masks or denoising schedules with varying omega values can
be applied to achieve region-specific or timestep-specific granularity control.
Prior knowledge of image composition from control signals or reference images
further facilitates the creation of precise omega masks for granularity
control on specific objects. To highlight the parameter's role in controlling
subtle detail variations, the technique is named Omegance, combining "omega"
and "nuance". Our method demonstrates impressive performance across various
image and video synthesis tasks and is adaptable to advanced diffusion models.
The code is available at https://github.com/itsmag11/Omegance.
Speculative Decoding (SD) has become an important technique in accelerating
the inference speed of large language models. Conventional SD methods employ a
fixed draft length, which ignores the token generation difficulty across tasks.
Consequently, in this paper, we address such an issue and introduce SVIP - a
difficulty-aware dynamic draft length policy for speculative decoding systems.
Based on a theoretical lower bound of draft token acceptance rate and its
inference-time approximation, SVIP adaptively determines the lengths of draft
sequences based on the entropy of each draft token distribution. Experimental
results on mainstream SD benchmarks and frameworks demonstrate the superior
performance of SVIP, achieving up to 20\% walltime speedup on SpecBench over
baseline SD methods and 60\% speedup on MT-Bench for long-form generation of up
to 8K tokens. Moreover, SVIP is totally training-free and compatible with any
existing SD methods that generate draft tokens autoregressively. Experimental
results also show that SVIP yields consistent walltime improvement on top of
GliDe & CaPE and EAGLE-2.
BySarim Hashmi, Juan Lugo, Abdelrahman Elsayed, Dinesh Saggurthi, Mohammed Elseiagy, Alikhan Nurkamal, Jaskaran Walia, Fadillah Adamsyah Maani, Mohammad Yaqub
5
2
Identifying key pathological features in brain MRIs is crucial for the
long-term survival of glioma patients. However, manual segmentation is
time-consuming, requiring expert intervention and is susceptible to human
error. Therefore, significant research has been devoted to developing machine
learning methods that can accurately segment tumors in 3D multimodal brain MRI
scans. Despite their progress, state-of-the-art models are often limited by the
data they are trained on, raising concerns about their reliability when applied
to diverse populations that may introduce distribution shifts. Such shifts can
stem from lower quality MRI technology (e.g., in sub-Saharan Africa) or
variations in patient demographics (e.g., children). The BraTS-2024 challenge
provides a platform to address these issues. This study presents our
methodology for segmenting tumors in the BraTS-2024 SSA and Pediatric Tumors
tasks using MedNeXt, comprehensive model ensembling, and thorough
postprocessing. Our approach demonstrated strong performance on the unseen
validation set, achieving an average Dice Similarity Coefficient (DSC) of 0.896
on the BraTS-2024 SSA dataset and an average DSC of 0.830 on the BraTS
Pediatric Tumor dataset. Additionally, our method achieved an average Hausdorff
Distance (HD95) of 14.682 on the BraTS-2024 SSA dataset and an average HD95 of
37.508 on the BraTS Pediatric dataset. Our GitHub repository can be accessed
here: Project Repository :
https://github.com/python-arch/BioMbz-Optimizing-Brain-Tumor-Segmentation-with-MedNeXt-BraTS-2024-SSA-and-Pediatrics
Recent researches on video large language models (VideoLLM) predominantly
focus on model architectures and training datasets, leaving the interaction
format between the user and the model under-explored. In existing works, users
often interact with VideoLLMs by using the entire video and a query as input,
after which the model generates a response. This interaction format constrains
the application of VideoLLMs in scenarios such as live-streaming comprehension
where videos do not end and responses are required in a real-time manner, and
also results in unsatisfactory performance on time-sensitive tasks that
requires localizing video segments. In this paper, we focus on a video-text
duet interaction format. This interaction format is characterized by the
continuous playback of the video, and both the user and the model can insert
their text messages at any position during the video playback. When a text
message ends, the video continues to play, akin to the alternative of two
performers in a duet. We construct MMDuetIT, a video-text training dataset
designed to adapt VideoLLMs to video-text duet interaction format. We also
introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to
benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT,
MMDuet demonstrates that adopting the video-text duet interaction format
enables the model to achieve significant improvements in various time-sensitive
tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights
highlight detection and 25% R@0.5 on Charades-STA temporal video grounding)
with minimal training efforts, and also enable VideoLLMs to reply in a
real-time manner as the video plays. Code, data and demo are available at:
https://github.com/yellow-binary-tree/MMDuet.
ByDavid Serrano-Lozano, Luis Herranz, Shaolin Su, Javier Vazquez-Corral
4
2
Blind all-in-one image restoration models aim to recover a high-quality image
from an input degraded with unknown distortions. However, these models require
all the possible degradation types to be defined during the training stage
while showing limited generalization to unseen degradations, which limits their
practical application in complex cases. In this paper, we propose a simple but
effective adaptive blind all-in-one restoration (ABAIR) model, which can
address multiple degradations, generalizes well to unseen degradations, and
efficiently incorporate new degradations by training a small fraction of
parameters. First, we train our baseline model on a large dataset of natural
images with multiple synthetic degradations, augmented with a segmentation head
to estimate per-pixel degradation types, resulting in a powerful backbone able
to generalize to a wide range of degradations. Second, we adapt our baseline
model to varying image restoration tasks using independent low-rank adapters.
Third, we learn to adaptively combine adapters to versatile images via a
flexible and lightweight degradation estimator. Our model is both powerful in
handling specific distortions and flexible in adapting to complex tasks, it not
only outperforms the state-of-the-art by a large margin on five- and three-task
IR setups, but also shows improved generalization to unseen degradations and
also composite distortions.
The rapid advancement of large language models (LLMs) such as GPT-3, PaLM,
and Llama has significantly transformed natural language processing, showcasing
remarkable capabilities in understanding and generating language. However,
these models often struggle with tasks requiring complex reasoning,
particularly in mathematical problem-solving, due in part to the scarcity of
large-scale, high-quality, domain-specific datasets necessary for training
sophisticated reasoning abilities. To address this limitation, we introduce
Template-based Data Generation (TDG), a novel approach that leverages LLMs
(GPT-4) to automatically generate parameterized meta-templates, which are then
used to synthesize a vast array of high-quality problems and solutions.
Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset
comprising over 7 million synthetically generated grade school math
problems--each accompanied by code-based and natural language solutions--with
the potential to generate an effectively unlimited number more. This dataset
alleviates the scarcity of large-scale mathematical datasets and serves as a
valuable resource for pre-training, fine-tuning, and evaluating LLMs in
mathematical reasoning. Our method not only enables the generation of virtually
infinite data but also elevates data augmentation to a new level by using GPT-4
for meta-template generation, ensuring diverse and high-quality problem
structures. The TemplateMath Part I: TemplateGSM dataset is publicly available
at https://huggingface.co/datasets/math-ai/TemplateGSM. The code is available
at https://github.com/iiis-ai/TemplateMath.
Recent advancements in diffusion models have made generative image editing
more accessible, enabling creative edits but raising ethical concerns,
particularly regarding malicious edits to human portraits that threaten privacy
and identity security. Existing protection methods primarily rely on
adversarial perturbations to nullify edits but often fail against diverse
editing requests. We propose FaceLock, a novel approach to portrait protection
that optimizes adversarial perturbations to destroy or significantly alter
biometric information, rendering edited outputs biometrically unrecognizable.
FaceLock integrates facial recognition and visual perception into perturbation
optimization to provide robust protection against various editing attempts. We
also highlight flaws in commonly used evaluation metrics and reveal how they
can be manipulated, emphasizing the need for reliable assessments of
protection. Experiments show FaceLock outperforms baselines in defending
against malicious edits and is robust against purification techniques. Ablation
studies confirm its stability and broad applicability across diffusion-based
editing algorithms. Our work advances biometric defense and sets the foundation
for privacy-preserving practices in image editing. The code is available at:
https://github.com/taco-group/FaceLock.
ByZhuo Li, Mingshuang Luo, Ruibing Hou, Xin Zhao, Hao Liu, Hong Chang, Zimo Liu, Chen Li
2
2
Human motion generation plays a vital role in applications such as digital
humans and humanoid robot control. However, most existing approaches disregard
physics constraints, leading to the frequent production of physically
implausible motions with pronounced artifacts such as floating and foot
sliding. In this paper, we propose Morph, a
Motion-free physics optimization framework,
comprising a Motion Generator and a Motion Physics Refinement module, for
enhancing physical plausibility without relying on costly real-world motion
data. Specifically, the Motion Generator is responsible for providing
large-scale synthetic motion data, while the Motion Physics Refinement Module
utilizes these synthetic data to train a motion imitator within a physics
simulator, enforcing physical constraints to project the noisy motions into a
physically-plausible space. These physically refined motions, in turn, are used
to fine-tune the Motion Generator, further enhancing its capability.
Experiments on both text-to-motion and music-to-dance generation tasks
demonstrate that our framework achieves state-of-the-art motion generation
quality while improving physical plausibility drastically.