ByTrung Dao, Thuan Hoang Nguyen, Thanh Le, Duc Vu, Khoi Nguyen, Cuong Pham, Anh Tran
62
6
In this paper, we aim to enhance the performance of SwiftBrush, a prominent
one-step text-to-image diffusion model, to be competitive with its multi-step
Stable Diffusion counterpart. Initially, we explore the quality-diversity
trade-off between SwiftBrush and SD Turbo: the former excels in image
diversity, while the latter excels in image quality. This observation motivates
our proposed modifications in the training methodology, including better weight
initialization and efficient LoRA training. Moreover, our introduction of a
novel clamped CLIP loss enhances image-text alignment and results in improved
image quality. Remarkably, by combining the weights of models trained with
efficient LoRA and full training, we achieve a new state-of-the-art one-step
diffusion model, achieving an FID of 8.14 and surpassing all GAN-based and
multi-step Stable Diffusion models. The evaluation code is available at:
https://github.com/vinairesearch/swiftbrushv2.
ByYinghao Ma, Anders Øland, Anton Ragni, Bleiz MacSen Del Sette, Charalampos Saitis, Chris Donahue, Chenghua Lin, Christos Plachouras, Emmanouil Benetos, Elio Quinton, Elona Shatri, Fabio Morreale, Ge Zhang, György Fazekas, Gus Xia, Huan Zhang, Ilaria Manco, Jiawen Huang, Julien Guinot, Liwei Lin, Luca Marinelli, Max W. Y. Lam, Megha Sharma, Qiuqiang Kong, Roger B. Dannenberg, Ruibin Yuan, Shangda Wu, Shih-Lun Wu, Shuqi Dai, Shun Lei, Shiyin Kang, Simon Dixon, Wenhu Chen, Wehhao Huang, Xingjian Du, Xingwei Qu, Xu Tan, Yizhi Li, Zeyue Tian, Zhiyong Wu, Zhizheng Wu, Ziyang Ma, Ziyu Wang
44
2
In recent years, foundation models (FMs) such as large language models (LLMs)
and latent diffusion models (LDMs) have profoundly impacted diverse sectors,
including music. This comprehensive review examines state-of-the-art (SOTA)
pre-trained models and foundation models in music, spanning from representation
learning, generative learning and multimodal learning. We first contextualise
the significance of music in various industries and trace the evolution of AI
in music. By delineating the modalities targeted by foundation models, we
discover many of the music representations are underexplored in FM development.
Then, emphasis is placed on the lack of versatility of previous methods on
diverse music applications, along with the potential of FMs in music
understanding, generation and medical application. By comprehensively exploring
the details of the model pre-training paradigm, architectural choices,
tokenisation, finetuning methodologies and controllability, we emphasise the
important topics that should have been well explored, like instruction tuning
and in-context learning, scaling law and emergent ability, as well as
long-sequence modelling etc. A dedicated section presents insights into music
agents, accompanied by a thorough analysis of datasets and evaluations
essential for pre-training and downstream tasks. Finally, by underscoring the
vital importance of ethical considerations, we advocate that following research
on FM for music should focus more on such issues as interpretability,
transparency, human responsibility, and copyright issues. The paper offers
insights into future challenges and trends on FMs for music, aiming to shape
the trajectory of human-AI collaboration in the music realm.
ByDaoguang Zan, Zhirong Huang, Ailun Yu, Shaoxin Lin, Yifan Shi, Wei Liu, Dong Chen, Zongshuai Qi, Hao Yu, Lei Yu, Dezhi Ran, Muhan Zeng, Bo Shen, Pan Bian, Guangtai Liang, Bei Guan, Pengjie Huang, Tao Xie, Yongji Wang, Qianxiang Wang
41
2
GitHub issue resolving is a critical task in software engineering, recently
gaining significant attention in both industry and academia. Within this task,
SWE-bench has been released to evaluate issue resolving capabilities of large
language models (LLMs), but has so far only focused on Python version. However,
supporting more programming languages is also important, as there is a strong
demand in industry. As a first step toward multilingual support, we have
developed a Java version of SWE-bench, called SWE-bench-java. We have publicly
released the dataset, along with the corresponding Docker-based evaluation
environment and leaderboard, which will be continuously maintained and updated
in the coming months. To verify the reliability of SWE-bench-java, we implement
a classic method SWE-agent and test several powerful LLMs on it. As is well
known, developing a high-quality multi-lingual benchmark is time-consuming and
labor-intensive, so we welcome contributions through pull requests or
collaboration to accelerate its iteration and refinement, paving the way for
fully automated programming.
The rapid advancement of visual generative models necessitates efficient and
reliable evaluation methods. Arena platform, which gathers user votes on model
comparisons, can rank models with human preferences. However, traditional Arena
methods, while established, require an excessive number of comparisons for
ranking to converge and are vulnerable to preference noise in voting,
suggesting the need for better approaches tailored to contemporary evaluation
challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable
platform based on a key insight: images and videos possess higher perceptual
intuitiveness than texts, enabling rapid evaluation of multiple samples
simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing
K models to engage in free-for-all competitions, which yield much richer
information than pairwise comparisons. To enhance the robustness of the system,
we leverage probabilistic modeling and Bayesian updating techniques. We propose
an exploration-exploitation-based matchmaking strategy to facilitate more
informative comparisons. In our experiments, K-Sort Arena exhibits 16.3x faster
convergence compared to the widely used ELO algorithm. To further validate the
superiority and obtain a comprehensive leaderboard, we collect human feedback
via crowdsourced evaluations of numerous cutting-edge text-to-image and
text-to-video models. Thanks to its high efficiency, K-Sort Arena can
continuously incorporate emerging models and update the leaderboard with
minimal votes. Our project has undergone several months of internal testing and
is now available at https://huggingface.co/spaces/ksort/K-Sort-Arena
ByChansung Park, Juyong Jiang, Fan Wang, Sayak Paul, Jing Tang, Sunghun Kim
25
3
The widespread adoption of cloud-based proprietary large language models
(LLMs) has introduced significant challenges, including operational
dependencies, privacy concerns, and the necessity of continuous internet
connectivity. In this work, we introduce an LLMOps pipeline, "LlamaDuo", for
the seamless migration of knowledge and abilities from service-oriented LLMs to
smaller, locally manageable models. This pipeline is crucial for ensuring
service continuity in the presence of operational failures, strict privacy
policies, or offline requirements. Our LlamaDuo involves fine-tuning a small
language model against the service LLM using a synthetic dataset generated by
the latter. If the performance of the fine-tuned model falls short of
expectations, it is enhanced by further fine-tuning with additional similar
data created by the service LLM. This iterative process guarantees that the
smaller model can eventually match or even surpass the service LLM's
capabilities in specific downstream tasks, offering a practical and scalable
solution for managing AI deployments in constrained environments. Extensive
experiments with leading edge LLMs are conducted to demonstrate the
effectiveness, adaptability, and affordability of LlamaDuo across various
downstream tasks. Our pipeline implementation is available at
https://github.com/deep-diver/llamaduo.
ByYikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda
24
4
Finding the optimal learning rate for language model pretraining is a
challenging task. This is not only because there is a complicated correlation
between learning rate, batch size, number of training tokens, model size, and
other hyperparameters but also because it is prohibitively expensive to perform
a hyperparameter search for large language models with Billions or Trillions of
parameters. Recent studies propose using small proxy models and small corpus to
perform hyperparameter searches and transposing the optimal parameters to large
models and large corpus. While the zero-shot transferability is theoretically
and empirically proven for model size related hyperparameters, like depth and
width, the zero-shot transfer from small corpus to large corpus is
underexplored. In this paper, we study the correlation between optimal learning
rate, batch size, and number of training tokens for the recently proposed WSD
scheduler. After thousands of small experiments, we found a power-law
relationship between variables and demonstrated its transferability across
model sizes. Based on the observation, we propose a new learning rate
scheduler, Power scheduler, that is agnostic about the number of training
tokens and batch size. The experiment shows that combining the Power scheduler
with Maximum Update Parameterization (muP) can consistently achieve impressive
performance with one set of hyperparameters regardless of the number of
training tokens, batch size, model size, and even model architecture. Our 3B
dense and MoE models trained with the Power scheduler achieve comparable
performance as state-of-the-art small language models. We open-source these
pretrained models at https://ibm.biz/BdKhLa.
ByWenhao Li, Yichao Cao, Xie Su, Xi Lin, Shan You, Mingkai Zheng, Yi Chen, Chang Xu
23
2
Video generation models hold substantial potential in areas such as
filmmaking. However, current video diffusion models need high computational
costs and produce suboptimal results due to high complexity of video generation
task. In this paper, we propose ConFiner, an efficient high-quality
video generation framework that decouples video generation into easier
subtasks: structure control and spatial-temporal refinement.
It can generate high-quality videos with chain of off-the-shelf diffusion model
experts, each expert responsible for a decoupled subtask. During the
refinement, we introduce coordinated denoising, which can merge multiple
diffusion experts' capabilities into a single sampling. Furthermore, we design
ConFiner-Long framework, which can generate long coherent video with three
constraint strategies on ConFiner. Experimental results indicate that with only
10\% of the inference cost, our ConFiner surpasses representative models like
Lavie and Modelscope across all objective and subjective metrics. And
ConFiner-Long can generate high-quality and coherent videos with up to 600
frames.
In multiplayer, first-person shooter games like Counter-Strike: Global
Offensive (CS:GO), coordinated movement is a critical component of high-level
strategic play. However, the complexity of team coordination and the variety of
conditions present in popular game maps make it impractical to author
hand-crafted movement policies for every scenario. We show that it is possible
to take a data-driven approach to creating human-like movement controllers for
CS:GO. We curate a team movement dataset comprising 123 hours of professional
game play traces, and use this dataset to train a transformer-based movement
model that generates human-like team movement for all players in a "Retakes"
round of the game. Importantly, the movement prediction model is efficient.
Performing inference for all players takes less than 0.5 ms per game step
(amortized cost) on a single CPU core, making it plausible for use in
commercial games today. Human evaluators assess that our model behaves more
like humans than both commercially-available bots and procedural movement
controllers scripted by experts (16% to 59% higher by TrueSkill rating of
"human-like"). Using experiments involving in-game bot vs. bot self-play, we
demonstrate that our model performs simple forms of teamwork, makes fewer
common movement mistakes, and yields movement distributions, player lifetimes,
and kill locations similar to those observed in professional CS:GO match play.
The increasing usage of Large Language Models (LLMs) has resulted in a
surging demand for planet-scale serving systems, where tens of thousands of
GPUs continuously serve hundreds of millions of users. Consequently, throughput
(under reasonable latency constraints) has emerged as a key metric that
determines serving systems' performance. To boost throughput, various methods
of inter-device parallelism (e.g., data, tensor, pipeline) have been explored.
However, existing methods do not consider overlapping the utilization of
different resources within a single device, leading to underutilization and
sub-optimal performance.
We propose NanoFlow, a novel serving framework that exploits intra-device
parallelism, which overlaps the usage of resources including compute, memory,
and network within a single device through operation co-scheduling. To exploit
intra-device parallelism, NanoFlow introduces two key innovations: First,
NanoFlow splits requests into nano-batches at the granularity of operations,
which breaks the dependency of sequential operations in LLM inference and
enables overlapping; then, to get benefit from overlapping, NanoFlow uses an
operation-level pipeline with execution unit scheduling, which partitions the
device's functional units and simultaneously executes different operations in
each unit. NanoFlow automates the pipeline setup using a parameter search
algorithm, which enables easily porting NanoFlow to different models. We
implement NanoFlow on NVIDIA GPUs and evaluate end-to-end serving throughput on
several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc..
With practical workloads, NanoFlow provides 1.91x throughput boost compared to
state-of-the-art serving systems achieving 59% to 72% of optimal throughput
across ported models.
Multimodal Large Language Models (MM-LLMs) have seen significant advancements
in the last year, demonstrating impressive performance across tasks. However,
to truly democratize AI, models must exhibit strong capabilities and be able to
run efficiently on small compute footprints accessible by most. Part of this
quest, we introduce LLaVaOLMoBitnet1B - the first Ternary Multimodal LLM
capable of accepting Image(s)+Text inputs to produce coherent textual
responses. The model is fully open-sourced along with training scripts to
encourage further research in this space. This accompanying technical report
highlights the training process, evaluation details, challenges associated with
ternary models and future opportunities. Link to the model:
https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B
Large language models (LLMs) have revolutionized language processing,
delivering outstanding results across multiple applications. However, deploying
LLMs on edge devices poses several challenges with respect to memory, energy,
and compute costs, limiting their widespread use in devices such as mobile
phones. A promising solution is to reduce the number of bits used to represent
weights and activations. While existing works have found partial success at
quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations
beyond 16 bits often leads to large computational overheads due to poor
on-device quantization support, or a considerable accuracy drop. Yet, 8-bit
activations are very attractive for on-device deployment as they would enable
LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units
(NPUs). In this work, we make a first attempt to facilitate the on-device
deployment of LLMs using integer-only quantization. We first investigate the
limitations of existing quantization methods for on-device deployment, with a
special focus on activation quantization. We then address these limitations by
introducing a simple post-training quantization method, named MobileQuant, that
extends previous weight equivalent transformation works by jointly optimizing
the weight transformation and activation range parameters in an end-to-end
manner. MobileQuant demonstrates superior capabilities over existing methods by
1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2)
reducing latency and energy consumption by 20\%-50\% compared to current
on-device quantization strategies, 3) requiring limited compute budget, 4)
being compatible with mobile-friendly compute units, e.g. NPU.
Transition videos play a crucial role in media production, enhancing the flow
and coherence of visual narratives. Traditional methods like morphing often
lack artistic appeal and require specialized skills, limiting their
effectiveness. Recent advances in diffusion model-based video generation offer
new possibilities for creating transitions but face challenges such as poor
inter-frame relationship modeling and abrupt content changes. We propose a
novel training-free Transition Video Generation (TVG) approach using
video-level diffusion models that addresses these limitations without
additional training. Our method leverages Gaussian Process Regression
(GPR) to model latent representations, ensuring smooth and dynamic
transitions between frames. Additionally, we introduce interpolation-based
conditional controls and a Frequency-aware Bidirectional Fusion (FBiF)
architecture to enhance temporal control and transition reliability.
Evaluations of benchmark datasets and custom image pairs demonstrate the
effectiveness of our approach in generating high-quality smooth transition
videos. The code are provided in https://sobeymil.github.io/tvg.com.
ByYi Liu, Junzhe Yu, Huijia Sun, Ling Shi, Gelei Deng, Yuqi Chen, Yang Liu
13
4
Large language models (LLMs) like ChatGPT and Gemini have significantly
advanced natural language processing, enabling various applications such as
chatbots and automated content generation. However, these models can be
exploited by malicious individuals who craft toxic prompts to elicit harmful or
unethical responses. These individuals often employ jailbreaking techniques to
bypass safety mechanisms, highlighting the need for robust toxic prompt
detection methods. Existing detection techniques, both blackbox and whitebox,
face challenges related to the diversity of toxic prompts, scalability, and
computational efficiency. In response, we propose ToxicDetector, a lightweight
greybox method designed to efficiently detect toxic prompts in LLMs.
ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding
vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP)
classifier for prompt classification. Our evaluation on various versions of the
LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector
achieves a high accuracy of 96.39\% and a low false positive rate of 2.00\%,
outperforming state-of-the-art methods. Additionally, ToxicDetector's
processing time of 0.0780 seconds per prompt makes it highly suitable for
real-time applications. ToxicDetector achieves high accuracy, efficiency, and
scalability, making it a practical method for toxic prompt detection in LLMs.
ByXu He, Xiaoyu Li, Di Kang, Jiangnan Ye, Chaopeng Zhang, Liyang Chen, Xiangjun Gao, Han Zhang, Zhiyong Wu, Haolin Zhuang
11
2
Existing works in single-image human reconstruction suffer from weak
generalizability due to insufficient training data or 3D inconsistencies for a
lack of comprehensive multi-view knowledge. In this paper, we introduce
MagicMan, a human-specific multi-view diffusion model designed to generate
high-quality novel view images from a single reference image. As its core, we
leverage a pre-trained 2D diffusion model as the generative prior for
generalizability, with the parametric SMPL-X model as the 3D body prior to
promote 3D awareness. To tackle the critical challenge of maintaining
consistency while achieving dense multi-view generation for improved 3D human
reconstruction, we first introduce hybrid multi-view attention to facilitate
both efficient and thorough information interchange across different views.
Additionally, we present a geometry-aware dual branch to perform concurrent
generation in both RGB and normal domains, further enhancing consistency via
geometry cues. Last but not least, to address ill-shaped issues arising from
inaccurate SMPL-X estimation that conflicts with the reference image, we
propose a novel iterative refinement strategy, which progressively optimizes
SMPL-X accuracy while enhancing the quality and consistency of the generated
multi-views. Extensive experimental results demonstrate that our method
significantly outperforms existing approaches in both novel view synthesis and
subsequent 3D human reconstruction tasks.