ByPooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen
84
17
Large language models with vision capabilities (VLMs), e.g., GPT-4o and
Gemini 1.5 Pro are powering countless image-text applications and scoring high
on many vision-understanding benchmarks. Yet, we find that VLMs fail on 7
visual tasks absurdly easy to humans such as identifying (a) whether two
circles overlap; (b) whether two lines intersect; (c) which letter is being
circled in a word; and (d) counting the number of circles in a Olympic-like
logo. The shockingly poor performance of four state-of-the-art VLMs suggests
their vision is, at best, like of a person with myopia seeing fine details as
blurry, and at worst, like an intelligent person that is blind making educated
guesses. Code is available at: https://vlmsareblind.github.io/
ByArindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah
50
15
Synthetic data is becoming increasingly important for accelerating the
development of language models, both large and small. Despite several
successful use cases, researchers also raised concerns around model collapse
and drawbacks of imitating other models. This discrepancy can be attributed to
the fact that synthetic data varies in quality and diversity. Effective use of
synthetic data usually requires significant human effort in curating the data.
We focus on using synthetic data for post-training, specifically creating data
by powerful models to teach a new skill or behavior to another model, we refer
to this setting as Generative Teaching. We introduce AgentInstruct, an
extensible agentic framework for automatically creating large amounts of
diverse and high-quality synthetic data. AgentInstruct can create both the
prompts and responses, using only raw data sources like text documents and code
files as seeds. We demonstrate the utility of AgentInstruct by creating a post
training dataset of 25M pairs to teach language models different skills, such
as text editing, creative writing, tool usage, coding, reading comprehension,
etc. The dataset can be used for instruction tuning of any base model. We
post-train Mistral-7b with the data. When comparing the resulting model Orca-3
to Mistral-7b-Instruct (which uses the same base model), we observe significant
improvements across many benchmarks. For example, 40% improvement on AGIEval,
19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and
45% improvement on AlpacaEval. Additionally, it consistently outperforms other
models such as LLAMA-8B-instruct and GPT-3.5-turbo.
ByWeize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun
27
4
The rapid advancement of large language models (LLMs) has paved the way for
the development of highly capable autonomous agents. However, existing
multi-agent frameworks often struggle with integrating diverse capable
third-party agents due to reliance on agents defined within their own
ecosystems. They also face challenges in simulating distributed environments,
as most frameworks are limited to single-device setups. Furthermore, these
frameworks often rely on hard-coded communication pipelines, limiting their
adaptability to dynamic task requirements. Inspired by the concept of the
Internet, we propose the Internet of Agents (IoA), a novel framework that
addresses these limitations by providing a flexible and scalable platform for
LLM-based multi-agent collaboration. IoA introduces an agent integration
protocol, an instant-messaging-like architecture design, and dynamic mechanisms
for agent teaming and conversation flow control. Through extensive experiments
on general assistant tasks, embodied AI tasks, and retrieval-augmented
generation benchmarks, we demonstrate that IoA consistently outperforms
state-of-the-art baselines, showcasing its ability to facilitate effective
collaboration among heterogeneous agents. IoA represents a step towards linking
diverse agents in an Internet-like environment, where agents can seamlessly
collaborate to achieve greater intelligence and capabilities. Our codebase has
been released at https://github.com/OpenBMB/IoA.
The performance of Large Vision Language Models (LVLMs) is dependent on the
size and quality of their training datasets. Existing video instruction tuning
datasets lack diversity as they are derived by prompting large language models
with video captions to generate question-answer pairs, and are therefore mostly
descriptive. Meanwhile, many labeled video datasets with diverse labels and
supervision exist - however, we find that their integration into LVLMs is
non-trivial. Herein, we present Video Self-Training with augmented Reasoning
(Video-STaR), the first video self-training approach. Video-STaR allows the
utilization of any labeled video dataset for video instruction tuning. In
Video-STaR, an LVLM cycles between instruction generation and finetuning, which
we show (I) improves general video understanding and (II) adapts LVLMs to novel
downstream tasks with existing supervision. During generation, an LVLM is
prompted to propose an answer. The answers are then filtered only to those that
contain the original video labels, and the LVLM is then re-trained on the
generated dataset. By only training on generated answers that contain the
correct video labels, Video-STaR utilizes these existing video labels as weak
supervision for video instruction tuning. Our results demonstrate that
Video-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA,
where TempCompass performance improved by 10%, and (II) on downstream tasks,
where Video-STaR improved Kinetics700-QA accuracy by 20% and action quality
assessment on FineDiving by 15%.
We present RodinHD, which can generate high-fidelity 3D avatars from a
portrait image. Existing methods fail to capture intricate details such as
hairstyles which we tackle in this paper. We first identify an overlooked
problem of catastrophic forgetting that arises when fitting triplanes
sequentially on many avatars, caused by the MLP decoder sharing scheme. To
overcome this issue, we raise a novel data scheduling strategy and a weight
consolidation regularization term, which improves the decoder's capability of
rendering sharper details. Additionally, we optimize the guiding effect of the
portrait image by computing a finer-grained hierarchical representation that
captures rich 2D texture cues, and injecting them to the 3D diffusion model at
multiple layers via cross-attention. When trained on 46K avatars with a noise
schedule optimized for triplanes, the resulting model can generate 3D avatars
with notably better details than previous methods and can generalize to
in-the-wild portrait input.
ByShaltiel Shmidman, Avi Shmidman, Amir DN Cohen, Moshe Koppel
22
1
Training large language models (LLMs) in low-resource languages such as
Hebrew poses unique challenges. In this paper, we introduce DictaLM2.0 and
DictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on a
substantial corpus of approximately 200 billion tokens in both Hebrew and
English. Adapting a pre-trained model to a new language involves specialized
techniques that differ significantly from training a model from scratch or
further training existing models on well-resourced languages such as English.
We outline these novel training methodologies, which facilitate effective
learning and adaptation to the linguistic properties of Hebrew. Additionally,
we fine-tuned DictaLM2.0-Instruct on a comprehensive instruct dataset to
enhance its performance on task-specific instructions. To rigorously evaluate
our models, we introduce a new benchmark suite for Hebrew LLM evaluation,
covering a diverse set of tasks including Question Answering, Sentiment
Analysis, Winograd Schema Challenge, Translation, and Summarization. Our work
not only addresses the intricacies of training LLMs in low-resource languages
but also proposes a framework that can be leveraged for adapting other LLMs to
various non-English languages, contributing to the broader field of
multilingual NLP.
Sora's high-motion intensity and long consistent videos have significantly
impacted the field of video generation, attracting unprecedented attention.
However, existing publicly available datasets are inadequate for generating
Sora-like videos, as they mainly contain short videos with low motion intensity
and brief captions. To address these issues, we propose MiraData, a
high-quality video dataset that surpasses previous ones in video duration,
caption detail, motion strength, and visual quality. We curate MiraData from
diverse, manually selected sources and meticulously process the data to obtain
semantically consistent clips. GPT-4V is employed to annotate structured
captions, providing detailed descriptions from four different perspectives
along with a summarized dense caption. To better assess temporal consistency
and motion intensity in video generation, we introduce MiraBench, which
enhances existing benchmarks by adding 3D consistency and tracking-based motion
strength metrics. MiraBench includes 150 evaluation prompts and 17 metrics
covering temporal consistency, motion strength, 3D consistency, visual quality,
text-video alignment, and distribution similarity. To demonstrate the utility
and effectiveness of MiraData, we conduct experiments using our DiT-based video
generation model, MiraDiT. The experimental results on MiraBench demonstrate
the superiority of MiraData, especially in motion strength.
We introduce BM25S, an efficient Python-based implementation of BM25 that
only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared
to the most popular Python-based framework by eagerly computing BM25 scores
during indexing and storing them into sparse matrices. It also achieves
considerable speedups compared to highly optimized Java-based implementations,
which are used by popular commercial products. Finally, BM25S reproduces the
exact implementation of five BM25 variants based on Kamphuis et al. (2020) by
extending eager scoring to non-sparse variants using a novel score shifting
method. The code can be found at https://github.com/xhluca/bm25s
Proving mathematical theorems using computer-verifiable formal languages like
Lean significantly impacts mathematical reasoning. One approach to formal
theorem proving involves generating complete proofs using Large Language Models
(LLMs) based on Natural Language (NL) proofs. Similar methods have shown
promising results in code generation. However, most modern LLMs exhibit
suboptimal performance due to the scarcity of aligned NL and Formal Language
(FL) theorem-proving data. This scarcity results in a paucity of methodologies
for training LLMs and techniques to fully utilize their capabilities in
composing formal proofs. To address the challenges, this paper proposes
**TheoremLlama**, an end-to-end framework to train a general-purpose LLM to
become a Lean4 expert. This framework encompasses NL-FL aligned dataset
generation methods, training approaches for the LLM formal theorem prover, and
techniques for LLM Lean4 proof writing. Using the dataset generation method, we
provide *Open Bootstrapped Theorems* (OBT), an NL-FL aligned and bootstrapped
dataset. A key innovation in this framework is the NL-FL bootstrapping method,
where NL proofs are integrated into Lean4 code for training datasets,
leveraging the NL reasoning ability of LLMs for formal reasoning. The
**TheoremLlama** framework achieves cumulative accuracies of 36.48% and 33.61%
on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline
of 22.95% and 25.41%. We have also open-sourced our model checkpoints and
generated dataset, and will soon make all the code publicly available.
ByFrederic Z. Zhang, Paul Albert, Cristian Rodriguez-Opazo, Anton van den Hengel, Ehsan Abbasnejad
12
3
Pre-trained models produce strong generic representations that can be adapted
via fine-tuning. The learned weight difference relative to the pre-trained
model, known as a task vector, characterises the direction and stride of
fine-tuning. The significance of task vectors is such that simple arithmetic
operations on them can be used to combine diverse representations from
different domains. This paper builds on these properties of task vectors and
aims to answer (1) whether components of task vectors, particularly parameter
blocks, exhibit similar characteristics, and (2) how such blocks can be used to
enhance knowledge composition and transfer. To this end, we introduce aTLAS, an
algorithm that linearly combines parameter blocks with different learned
coefficients, resulting in anisotropic scaling at the task vector level. We
show that such linear combinations explicitly exploit the low intrinsic
dimensionality of pre-trained models, with only a few coefficients being the
learnable parameters. Furthermore, composition of parameter blocks leverages
the already learned representations, thereby reducing the dependency on large
amounts of data. We demonstrate the effectiveness of our method in task
arithmetic, few-shot recognition and test-time adaptation, with supervised or
unsupervised objectives. In particular, we show that (1) learned anisotropic
scaling allows task vectors to be more disentangled, causing less interference
in composition; (2) task vector composition excels with scarce or no labeled
data and is less prone to domain shift, thus leading to better
generalisability; (3) mixing the most informative parameter blocks across
different task vectors prior to training can reduce the memory footprint and
improve the flexibility of knowledge transfer. Moreover, we show the potential
of aTLAS as a PEFT method, particularly with less data, and demonstrate that
its scalibility.
ByYung-Sung Chuang, Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, James Glass
12
3
When asked to summarize articles or answer questions given a passage, large
language models (LLMs) can hallucinate details and respond with unsubstantiated
answers that are inaccurate with respect to the input context. This paper
describes a simple approach for detecting such contextual hallucinations. We
hypothesize that contextual hallucinations are related to the extent to which
an LLM attends to information in the provided context versus its own
generations. Based on this intuition, we propose a simple hallucination
detection model whose input features are given by the ratio of attention
weights on the context versus newly generated tokens (for each attention head).
We find that a linear classifier based on these lookback ratio features is as
effective as a richer detector that utilizes the entire hidden states of an LLM
or a text-based entailment model. The lookback ratio-based detector -- Lookback
Lens -- is found to transfer across tasks and even models, allowing a detector
that is trained on a 7B model to be applied (without retraining) to a larger
13B model. We further apply this detector to mitigate contextual
hallucinations, and find that a simple classifier-guided decoding approach is
able to reduce the amount of hallucination, for example by 9.6% in the XSum
summarization task.
ByYu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis Béthune, Hadi Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Marco Cuturi
11
1
Humans describe complex scenes with compositionality, using simple text
descriptions enriched with links and relationships. While vision-language
research has aimed to develop models with compositional understanding
capabilities, this is not reflected yet in existing datasets which, for the
most part, still use plain text to describe images. In this work, we propose a
new annotation strategy, graph-based captioning (GBC) that describes an image
using a labelled graph structure, with nodes of various types. The nodes in GBC
are created using, in a first stage, object detection and dense captioning
tools nested recursively to uncover and describe entity nodes, further linked
together in a second stage by highlighting, using new types of nodes,
compositions and relations among entities. Since all GBC nodes hold plain text
descriptions, GBC retains the flexibility found in natural language, but can
also encode hierarchical information in its edges. We demonstrate that GBC can
be produced automatically, using off-the-shelf multimodal LLMs and
open-vocabulary detection models, by building a new dataset, GBC10M, gathering
GBC annotations for about 10M images of the CC12M dataset. We use GBC10M to
showcase the wealth of node captions uncovered by GBC, as measured with CLIP
training. We show that using GBC nodes' annotations -- notably those stored in
composition and relation nodes -- results in significant performance boost on
downstream models when compared to other dataset formats. To further explore
the opportunities provided by GBC, we also propose a new attention mechanism
that can leverage the entire GBC graph, with encouraging experimental results
that show the extra benefits of incorporating the graph structure. Our datasets
are released at https://huggingface.co/graph-based-captions.
ByYuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, Sergey Tulyakov
10
1
Existing text-to-video diffusion models rely solely on text-only encoders for
their pretraining. This limitation stems from the absence of large-scale
multimodal prompt video datasets, resulting in a lack of visual grounding and
restricting their versatility and application in multimodal integration. To
address this, we construct a large-scale multimodal prompt dataset by employing
retrieval methods to pair in-context examples with the given text prompts and
then utilize a two-stage training strategy to enable diverse video generation
tasks within the same model. In the first stage, we propose a multimodal
conditional video generation framework for pretraining on these augmented
datasets, establishing a foundational model for grounded video generation.
Secondly, we finetune the model from the first stage on three video generation
tasks, incorporating multi-modal instructions. This process further refines the
model's ability to handle diverse inputs and tasks, ensuring seamless
integration of multi-modal information. After this two-stage train-ing process,
VIMI demonstrates multimodal understanding capabilities, producing contextually
rich and personalized videos grounded in the provided inputs, as shown in
Figure 1. Compared to previous visual grounded video generation methods, VIMI
can synthesize consistent and temporally coherent videos with large motion
while retaining the semantic control. Lastly, VIMI also achieves
state-of-the-art text-to-video generation results on UCF101 benchmark.
Large language models (LLMs) often exhibit undesirable behaviors, such as
hallucinations and sequence repetitions. We propose to view these behaviors as
fallbacks that models exhibit under uncertainty, and investigate the connection
between them. We categorize fallback behaviors -- sequence repetitions,
degenerate text, and hallucinations -- and extensively analyze them in models
from the same family that differ by the amount of pretraining tokens, parameter
count, or the inclusion of instruction-following training. Our experiments
reveal a clear and consistent ordering of fallback behaviors, across all these
axes: the more advanced an LLM is (i.e., trained on more tokens, has more
parameters, or instruction-tuned), its fallback behavior shifts from sequence
repetitions, to degenerate text, and then to hallucinations. Moreover, the same
ordering is observed throughout a single generation, even for the
best-performing models; as uncertainty increases, models shift from generating
hallucinations to producing degenerate text and then sequence repetitions.
Lastly, we demonstrate that while common decoding techniques, such as random
sampling, might alleviate some unwanted behaviors like sequence repetitions,
they increase harder-to-detect hallucinations.
ByBojana Bašaragin, Adela Ljajić, Darija Medvecki, Lorenzo Cassano, Miloš Košprdić, Nikola Milošević
4
1
Large language models (LLMs) have recently become the leading source of
answers for users' questions online. Despite their ability to offer eloquent
answers, their accuracy and reliability can pose a significant challenge. This
is especially true for sensitive domains such as biomedicine, where there is a
higher need for factually correct answers. This paper introduces a biomedical
retrieval-augmented generation (RAG) system designed to enhance the reliability
of generated responses. The system is based on a fine-tuned LLM for the
referenced question-answering, where retrieved relevant abstracts from PubMed
are passed to LLM's context as input through a prompt. Its output is an answer
based on PubMed abstracts, where each statement is referenced accordingly,
allowing the users to verify the answer. Our retrieval system achieves an
absolute improvement of 23% compared to the PubMed search engine. Based on the
manual evaluation on a small sample, our fine-tuned LLM component achieves
comparable results to GPT-4 Turbo in referencing relevant abstracts. We make
the dataset used to fine-tune the models and the fine-tuned models based on
Mistral-7B-instruct-v0.1 and v0.2 publicly available.
Recent advancements in language modeling have shown promising results when
applied to time series data. In particular, fine-tuning pre-trained large
language models (LLMs) for time series classification tasks has achieved
state-of-the-art (SOTA) performance on standard benchmarks. However, these
LLM-based models have a significant drawback due to the large model size, with
the number of trainable parameters in the millions. In this paper, we propose
an alternative approach to leveraging the success of language modeling in the
time series domain. Instead of fine-tuning LLMs, we utilize a language
embedding model to embed time series and then pair the embeddings with a simple
classification head composed of convolutional neural networks (CNN) and
multilayer perceptron (MLP). We conducted extensive experiments on
well-established time series classification benchmark datasets. We demonstrated
LETS-C not only outperforms the current SOTA in classification accuracy but
also offers a lightweight solution, using only 14.5% of the trainable
parameters on average compared to the SOTA model. Our findings suggest that
leveraging language encoders to embed time series data, combined with a simple
yet effective classification head, offers a promising direction for achieving
high-performance time series classification while maintaining a lightweight
model architecture.