翻訳付きの日次キュレーションされたAI研究論文
我々はLing 2.0を紹介する。これは「あらゆる活性化が推論能力を強化する」という原理に基づいて構築された、推論指向の言語基盤モデルシリーズである。統一されたMixture-of-Experts(MoE)パラダイムの下で数百億から1兆パラメータまでスケーリング可能に設計され、実証的なスケーリング則に導かれた高スパース性、クロススケール一貫性、効率性を重視している。本シリーズは16Bから1Tの総パラメータ規模を持つ3つの非思考(指示)モデル(Ling-mini-2.0、Ling-flash-2.0、Ling-1T)を含み、密モデルと比較して最大7倍のアクティブ演算効率を達成する。Ling 2.0は、モデルアーキテクチャ、事前学習、事後学習、インフラストラクチャにわたる協調的イノベーションを統合している:効率的推論のためのMTPを備えた高スパースMoE、推論指向データと学習中期CoT活性化、強化学習ベースのファインチューニング(DFT、Evo-CoT)、フルスケールFP8学習と細粒度異種パイプラインである。兆規模ではLing-1Tが推論精度と計算効率の新たなパレートフロンティアを確立し、スパース活性化が推論目標と適切に連携することでスケーラブルで効率的な知能を実現可能であることを実証した。総合的に、Ling 2.0は同一基盤上に構築されるRingシリーズを含む、将来の推論・思考モデル発展のための首尾一貫したオープンで効率的な基盤を提供する。
拡散ポリシーなどの生成モデルによってパラメータ化された暗黙的ポリシーは、ロボット工学におけるポリシー学習とVision-Language-Action(VLA)モデルの標準となっている。しかし、これらのアプローチは高い計算コスト、エクスポージャーバイアス、不安定な推論ダイナミクスに悩まされることが多く、分布シフト下での発散を引き起こす。エネルギー基準モデル(EBM)は、エネルギーランドスケープをエンドツーエンドで学習し平衡ダイナミクスをモデル化することでこれらの問題に対処し、改善されたロバスト性と低減されたエクスポージャーバイアスを提供する。しかし、EBMでパラメータ化されたポリシーは歴史的に効果的なスケーリングに苦戦してきた。エネルギー基準トランスフォーマー(EBT)に関する最近の研究は、高次元空間へのEBMのスケーラビリティを実証しているが、物理的実体を持つモデルの中核的課題解決への可能性は未開拓のままである。我々は、ロボットおよび実世界設定の中核的課題を解決する新しいエネルギー基準アーキテクチャ「EBT-Policy」を提案する。シミュレーションと実世界のタスク全体で、EBT-Policyは拡散ベースのポリシーを一貫して上回りながら、より少ない訓練と推論計算を必要とする。驚くべきことに、一部のタスクではわずか2回の推論ステップで収束し、拡散ポリシーの100ステップと比較して50倍の削減を実現した。さらにEBT-Policyは、明示的なリトライ訓練なしで行動クローニングのみを用いた失敗した行動シーケンスからのゼロショット回復など、従来モデルでは見られなかった創発的能力を示す。スカラーエネルギーを不確実性認識推論と動的計算割り当てに活用することで、EBT-Policyは分布シフト下でのロバストで一般化可能なロボット動作への有望な道筋を提供する。
テストタイムスケーリング(TTS)は、推論時に追加の計算リソースを割り当てることで大規模言語モデル(LLM)を改善する手法であり、通常は並列・逐次・ハイブリッドスケーリングによって実現されます。しかし、従来の研究では固定的な協調アーキテクチャ(トポロジーなど)と単一モデル使用が前提とされることが多く、最適なアーキテクチャとモデル組み合わせがタスクによって変化する点が看過されてきました。そこで本研究では、固定予算下でTTSにおける計算最適なモデル組み合わせとアーキテクチャを探索する新たな問題に着目します。これを、ノードが役割とLLMモデルの割当を符号化し、エッジが情報フローを捕捉するマルチLLM協調グラフとして定式化します。この問題は、(i)組み合わせ爆発を起こす探索空間の膨大さ、(ii)タスク特有の要件に応じた設計必要性という二つの難題を抱えています。これらの課題に対処するため、問題を確率的グラフ最適化として再定式化し、予備実験を通じてTTS協調グラフに関する三つの実証的知見を導出します。これらの知見に基づき、我々はAgent-REINFORCEを提案します。これは、サンプリング-勾配-更新のパイプラインをサンプリング-フィードバック-更新に対応付け、フィードバックをテキスト勾配として確率的グラフを更新するLLMエージェント拡張フレームワークであり、最適なマルチLLM協調グラフを効率的に探索します。実験結果では、Agent-REINFORCEが従来手法及びLLMベースラインをサンプル効率と探索性能の両面で上回り、精度と推論遅延の複合目標下で最適なグラフを効果的に同定できることを示します。
我々は、Physical AIのためのCosmos World Foundation Modelsの最新世代である[Cosmos-Predict2.5]を紹介する。フローベースのアーキテクチャを基盤として構築された[Cosmos-Predict2.5]は、Text2World、Image2World、Video2World生成を単一モデルに統合し、Physical AI視覚言語モデルである[Cosmos-Reason1]を活用して、より豊富なテキスト接地と世界シミュレーションの精密な制御を実現する。2億の厳選された動画クリップで学習され、強化学習ベースの事後学習で洗練された[Cosmos-Predict2.5]は、動画品質と指示への適合性において[Cosmos-Predict1]を大幅に上回り、2Bおよび14Bスケールのモデルが公開されている。これらの能力により、ロボティクスと自律システムにおける、より信頼性の高い合成データ生成、ポリシー評価、閉ループシミュレーションが可能となる。さらに我々は、Sim2RealおよびReal2Realの世界変換のためのControlNetスタイルのフレームワークである[Cosmos-Transfer2.5]をファミリーに追加する。[Cosmos-Transfer1]よりも3.5倍小型化されているにもかかわらず、より高忠実度でロバストな長期動画生成を実現する。これらの進歩により、[Cosmos-Predict2.5]と[Cosmos-Transfer2.5]は具身化知能をスケールさせるための汎用的なツールとして確立された。Physical AIの研究と展開を加速させるため、ソースコード、事前学習済みチェックポイント、厳選されたベンチマークをNVIDIA Open Model Licenseの下でhttps://github.com/nvidia-cosmos/cosmos-predict2.5 および https://github.com/nvidia-cosmos/cosmos-transfer2.5 に公開する。これらのオープンリソースが採用の障壁を下げ、次世代の具身化知能構築における革新を促進することを期待する。
マルチモーダル生成モデルの最近の進歩は、画像編集技術に大幅な改善をもたらしている。しかし、現行の生成モデルは、暗黙的な推論を必要とする多様で複雑な画像編集タスクの処理において依然として課題を抱えており、様々な推論シナリオにわたる性能を体系的に評価する包括的ベンチマークの必要性が浮き彫りになっている。既存のベンチマークは主に現実的なシナリオにおける単一オブジェクトの属性変換に焦点を当てており、有効ではあるものの、二つの重大な課題に直面している:(1)実世界アプリケーションで一般的な、複数オブジェクト間の相互作用や人間が定義したルールを含むゲーム世界シナリオがほとんど考慮されていない;(2)生成画像の評価がテキスト参照のみに依存しており、特に複雑な推論シナリオでは体系的な誤判定を招く可能性がある。この課題に対処するため、本研究では推論ベース画像編集評価のための統一ベンチマーク「UniREditBench」を提案する。これは2,700件の入念に精選されたサンプルから構成され、8つの主要次元と18の副次元にわたって実世界とゲーム世界の両シナリオを網羅している。評価の信頼性向上のために、マルチモーダル二重参照評価を導入し、各サンプル評価に対してテキスト参照と正解画像参照の両方を提供する。さらに、自動化されたマルチシナリオデータ合成パイプラインを設計し、高品質な連鎖的思考(CoT)推論アノテーションを付与した大規模合成データセット「UniREdit-Data-100K」を構築した。このデータセットでBagelをファインチューニングし、ドメイン内設定とドメイン外設定の両方で大幅な改善を示す「UniREdit-Bagel」を開発した。オープンソースおよびクローズドソースの画像編集モデルに対する徹底的なベンチマーク評価を通じて、様々な側面におけるそれらの強みと弱点を明らかにしている。
グラフニューラルネットワークはボトムアップ的なメッセージ伝達によって動作し、直感的に大域的な構造を最初に把握する人間の視覚認識とは根本的に異なる。本研究では、グラフ理解における視覚モデルの過小評価されていた可能性を調査し、確立されたベンチマークにおいてGNNと同等の性能を達成しながら、明らかに異なる学習パターンを示すことを発見した。これらの異なる挙動と、ドメイン特徴とトポロジー理解を混同する既存ベンチマークの限界を踏まえ、我々はGraphAbstractを提案する。このベンチマークは、組織の原型認識、対称性の検出、接続強度の感知、重要要素の特定など、人間のように大域的なグラフ特性を認識するモデルの能力を評価する。結果は、視覚モデルが全体的な構造理解を必要とするタスクでGNNを大幅に上回り、様々なグラフ規模にわたる一般化性を維持する一方、GNNは大域的なパターン抽象化に苦戦し、グラフサイズの増大に伴い性能が劣化することを明らかにした。本研究は、視覚モデルがグラフ構造理解において、特に大域的なトポロジー認識とスケール不変な推論を必要とする問題に対して、顕著でありながら十分に活用されていない能力を有することを実証する。これらの発見は、全体的なパターン認識が支配的なタスクにおいて、より効果的なグラフ基盤モデルを開発するためのこの過小評価された可能性を活用する新たな道を開くものである。
再照明は、実用的な需要と芸術的価値の両方を備えた重要な課題であり、近年の拡散モデルは豊かで制御性の高い照明効果を実現することで強い可能性を示している。しかし、これらのモデルは通常、意味的潜在空間で最適化されるため、視覚空間における物理的正確性が近接性によって保証されず、過曝光したハイライト、位置のずれた影、誤ったオクルージョンなど非現実的な結果を生成することが多い。我々はこの問題に対処するため、画像と動画の両方に対応する統一的な再照明フレームワークUniLumosを提案する。本手法は、フローマッチング基盤にRGB空間の幾何学フィードバックを組み込むことで、出力から抽出した深度マップと法線マップを用いてモデルを監督し、照明効果とシーン構造を明示的に整合させることで物理的な妥当性を高める。ただし、このフィードバックには視覚空間での監督のための高品質な出力が必要であり、標準的な多段階ノイズ除去は計算コストが高い。これを緩和するため、パス一貫性学習を採用し、少ステップの学習体制下でも監督が有効に機能するようにした。細粒度の再照明制御と監督を可能にするため、照明属性の核心を捉えた構造化された6次元注釈プロトコルを設計した。これを基盤として、大規模視覚言語モデルによる照明制御性を評価する分離属性レベルベンチマークLumosBenchを提案し、個々の次元にわたる再照明精度の自動的かつ解釈可能な評価を実現する。大規模な実験により、UniLumosが物理的一貫性を大幅に改善した状態で最先端の再照明品質を達成し、画像と動画の再照明において20倍の高速化を実現することを示す。コードはhttps://github.com/alibaba-damo-academy/Lumos-Customで公開されている。
Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited. We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response, thereby reducing factual fidelity. To address this issue, we propose MR-ALIGN, a Meta-Reasoning informed alignment framework that enhances factuality without relying on external verifiers. MR-ALIGN quantifies state transition probabilities along the model's thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at the atomic thinking segments. This re-weighting reshapes token-level signals into probability-aware segment scores, encouraging coherent reasoning trajectories that are more conducive to factual correctness. Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show that MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning. These results highlight that aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in LRMs.
Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.
Motion imitation is a promising approach for humanoid locomotion, enabling agents to acquire humanlike behaviors. Existing methods typically rely on high-quality motion capture datasets such as AMASS, but these are scarce and expensive, limiting scalability and diversity. Recent studies attempt to scale data collection by converting large-scale internet videos, exemplified by Humanoid-X. However, they often introduce physical artifacts such as floating, penetration, and foot skating, which hinder stable imitation. In response, we introduce PHUMA, a Physically-grounded HUMAnoid locomotion dataset that leverages human video at scale, while addressing physical artifacts through careful data curation and physics-constrained retargeting. PHUMA enforces joint limits, ensures ground contact, and eliminates foot skating, producing motions that are both large-scale and physically reliable. We evaluated PHUMA in two sets of conditions: (i) imitation of unseen motion from self-recorded test videos and (ii) path following with pelvis-only guidance. In both cases, PHUMA-trained policies outperform Humanoid-X and AMASS, achieving significant gains in imitating diverse motions. The code is available at https://davian-robotics.github.io/PHUMA.
Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing with Distribution Matching Distillation, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons: (1) bridging the domain gap from training on finite length and extrapolating to infinite horizons, (2) sustaining high quality by preventing error accumulation, and (3) maintaining fast inference, without incurring growth in computational cost due to increasing context windows. A key to our approach is introducing carefully designed sliding-window causal attention, combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.
Recently, large language models (LLMs) have demonstrated remarkable problem-solving capabilities by autonomously integrating with external tools for collaborative reasoning. However, due to the inherently complex and diverse nature of multimodal information, enabling multimodal large language models (MLLMs) to flexibly and efficiently utilize external tools during reasoning remains an underexplored challenge. In this work, we introduce ToolScope, an agentic framework designed to unify global planning with local multimodal perception, adopting a specialized Perceive tool to mitigates visual context degradation in long-horizon VQA task. ToolScope comprises three primary components: the Global Navigator, the Agentic Executor, and the Response Synthesizer. The Global Navigator functions as a "telescope", offering high-level strategic guidance. The Agentic Executor operates iteratively to augment MLLM with local perception through the integration of external tools-Search, Code, and Perceive. Finally, the Response Synthesizer consolidates and organizes the reasoning process into a coherent, user-friendly output. We evaluate ToolScope on four VQA benchmarks across diverse domains, including VQA 2.0, ScienceQA, MAT-Search and MathVista. It demonstrates strong generalization capabilities, achieving an average performance improvement of up to +6.69% across all datasets.
We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters, excelling at real-time audio-visual interaction. By adopting a curriculum-inspired progressive training strategy that transitions from simpler to increasingly complex modality sequence modeling tasks, LongCat-Flash-Omni attains comprehensive multimodal capabilities while maintaining strong unimodal capability. Building upon LongCat-Flash, which adopts a high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, LongCat-Flash-Omni integrates efficient multimodal perception and speech reconstruction modules. Despite its immense size of 560B parameters (with 27B activated), LongCat-Flash-Omni achieves low-latency real-time audio-visual interaction. For training infrastructure, we developed a modality-decoupled parallelism scheme specifically designed to manage the data and model heterogeneity inherent in large-scale multimodal training. This innovative approach demonstrates exceptional efficiency by sustaining over 90% of the throughput achieved by text-only training. Extensive evaluations show that LongCat-Flash-Omni achieves state-of-the-art performance on omni-modal benchmarks among open-source models. Furthermore, it delivers highly competitive results across a wide range of modality-specific tasks, including text, image, and video understanding, as well as audio understanding and generation. We provide a comprehensive overview of the model architecture design, training procedures, and data strategies, and open-source the model to foster future research and development in the community.
Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models' ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, progressing autonomously from basic to advanced mathematics.
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-with-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking-with-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce TIR-Bench, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use augmentation. Results show that TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Finally, we present a pilot study comparing direct versus agentic fine-tuning.
Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.
Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present left|,circlearrowright,text{BUS},right|, a large and diverse benchmark of 1,333 English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose RebusDescProgICE, a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of Vision-Language Models on left|,circlearrowright,text{BUS},right| by 2.1-4.1% and 20-30% using closed-source and open-source models respectively compared to Chain-of-Thought Reasoning.
Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. A consistent failure mode is indicator localization: models can read digits or labels but misidentify the key positions of pointers or alignments, leading to big numeric errors despite plausible textual reasoning. We have also conducted preliminary experiments with reinforcement learning over synthetic data, and find encouraging results on in-domain synthetic subset but less promising for real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.
We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove's data management features reduce memory consumption by a factor of 2.6. Moreover, Trove's easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.
Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop Curriculum RL with Off-Policy Influence guidance (CROPI), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.
Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved 2D visual understanding, prompting interest in their application to complex 3D reasoning tasks. However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning. Considering this issue, we introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs. Our approach employs a two-stage fine-tuning strategy: first, foundational knowledge is injected to the baseline MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in significant improvements across multiple tasks; second, generalization is enhanced through Reinforcement Learning using the Group Relative Policy Optimization (GRPO) algorithm on a broader set of questions. Additionally, we introduce a hybrid cold-start initialization method designed to simultaneously learn viewpoint representations and maintain coherent reasoning thinking. Experimental results show that our approach significantly activates the spatial reasoning ability of MLLM, improving performance on both in-domain and out-of-domain reasoning tasks. Our findings highlight the value of developing foundational spatial skills in MLLMs, supporting future progress in robotics, autonomous systems, and 3D scene understanding.
Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.
Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high-stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert-curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel, four-tiered framework tailored to assess model outputs from basic appearance to complex surgical strategy. On the basis of the SurgVeo benchmark, we task the advanced Veo-3 model with a zero-shot prediction task on surgical clips from laparoscopic and neurosurgical procedures. A panel of four board-certified surgeons evaluates the generated videos according to the SPP. Our results reveal a distinct "plausibility gap": while Veo-3 achieves exceptional Visual Perceptual Plausibility, it fails critically at higher levels of the SPP, including Instrument Operation Plausibility, Environment Feedback Plausibility, and Surgical Intent Plausibility. This work provides the first quantitative evidence of the chasm between visually convincing mimicry and causal understanding in surgical AI. Our findings from SurgVeo and the SPP establish a crucial foundation and roadmap for developing future models capable of navigating the complexities of specialized, real-world healthcare domains.
Vision-language-action (VLA) models aim to understand natural language instructions and visual observations and to execute corresponding actions as an embodied agent. Recent work integrates future images into the understanding-acting loop, yielding unified VLAs that jointly understand, generate, and act -- reading text and images and producing future images and actions. However, these models either rely on external experts for modality unification or treat image generation and action prediction as separate processes, limiting the benefits of direct synergy between these tasks. Our core philosophy is to optimize generation and action jointly through a synchronous denoising process, where the iterative refinement enables actions to evolve from initialization, under constant and sufficient visual guidance. We ground this philosophy in our proposed Unified Diffusion VLA and Joint Discrete Denoising Diffusion Process (JD3P), which is a joint diffusion process that integrates multiple modalities into a single denoising trajectory to serve as the key mechanism enabling understanding, generation, and acting to be intrinsically synergistic. Our model and theory are built on a unified tokenized space of all modalities and a hybrid attention mechanism. We further propose a two-stage training pipeline and several inference-time techniques that optimize performance and efficiency. Our approach achieves state-of-the-art performance on benchmarks such as CALVIN, LIBERO, and SimplerEnv with 4times faster inference than autoregressive methods, and we demonstrate its effectiveness through in-depth analysis and real-world evaluations. Our project page is available at https://irpn-eai.github.io/UD-VLA.github.io/.
The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings. Our code, models, and datasets will be publicly available at https://github.com/XMUDeepLIT/UME-R1.
Graphical user interface (GUI) grounding is a key function of computer-use agents, which maps natural-language instructions to actionable screen regions. Existing approaches based on Multimodal Large Language Models (MLLMs) typically formulate it as a text-based coordinate generation task, yet directly generating precise coordinates from visual inputs remains challenging and computationally intensive. An intuitive way to implement GUI grounding is to first select visual patches relevant to the instructions and then determine the precise click location within those patches. Based on the observations that general MLLMs have some native grounding capability, nested within their attentions, we propose GUI-AIMA, an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding. GUI-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. These signals are calculated adaptively for diverse user instructions by multi-head aggregation on simplified query-visual attention matrices. Besides, its coordinate-free manner can easily integrate a plug-and-play zoom-in stage. GUI-AIMA-3B was trained with only 85k screenshots, demonstrating exceptional data efficiency and verifying that light training can trigger the native grounding capability of MLLMs. It achieves state-of-the-art performance among 3B models, attaining an average accuracy of 58.6% on ScreenSpot-Pro and 62.2% on OSWorld-G. Project page: https://github.com/sjz5202/GUI-AIMA
Large Language Models (LLMs) have demonstrated strong capabilities in natural language reasoning, yet their application to Cyber Threat Intelligence (CTI) remains limited. CTI analysis involves distilling large volumes of unstructured reports into actionable knowledge, a process where LLMs could substantially reduce analyst workload. CTIBench introduced a comprehensive benchmark for evaluating LLMs across multiple CTI tasks. In this work, we extend CTIBench by developing AthenaBench, an enhanced benchmark that includes an improved dataset creation pipeline, duplicate removal, refined evaluation metrics, and a new task focused on risk mitigation strategies. We evaluate twelve LLMs, including state-of-the-art proprietary models such as GPT-5 and Gemini-2.5 Pro, alongside seven open-source models from the LLaMA and Qwen families. While proprietary LLMs achieve stronger results overall, their performance remains subpar on reasoning-intensive tasks, such as threat actor attribution and risk mitigation, with open-source models trailing even further behind. These findings highlight fundamental limitations in the reasoning capabilities of current LLMs and underscore the need for models explicitly tailored to CTI workflows and automation.
Natural Language Explanations (NLEs) describe how Large Language Models (LLMs) make decisions, drawing on both external Context Knowledge (CK) and Parametric Knowledge (PK) stored in model weights. Understanding their interaction is key to assessing the grounding of NLEs, yet it remains underexplored. Prior work has largely examined only single-step generation, typically the final answer, and has modelled PK and CK interaction only as a binary choice in a rank-1 subspace. This overlooks richer forms of interaction, such as complementary or supportive knowledge. We propose a novel rank-2 projection subspace that disentangles PK and CK contributions more accurately and use it for the first multi-step analysis of knowledge interactions across longer NLE sequences. Experiments on four QA datasets and three open-weight instruction-tuned LLMs show that diverse knowledge interactions are poorly represented in a rank-1 subspace but are effectively captured in our rank-2 formulation. Our multi-step analysis reveals that hallucinated NLEs align strongly with the PK direction, context-faithful ones balance PK and CK, and Chain-of-Thought prompting for NLEs shifts generated NLEs toward CK by reducing PK reliance. This work provides the first framework for systematic studies of multi-step knowledge interactions in LLMs through a richer rank-2 subspace disentanglement. Code and data: https://github.com/copenlu/pk-ck-knowledge-disentanglement.
In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC