每日精選AI研究論文及翻譯
推理是支撑推断、问题解决与决策制定的基本认知过程。尽管大型语言模型在封闭环境中展现出强大的推理能力,但在开放动态环境中仍面临挑战。智能体推理通过将大型语言模型重构为能够通过持续交互进行规划、行动与学习的自主智能体,实现了范式转变。本综述从三个互补维度系统梳理智能体推理研究:首先,通过三层架构刻画环境动态性——基础智能体推理建立稳定环境中包括规划、工具使用与搜索在内的核心单智能体能力;自演进智能体推理研究智能体如何通过反馈、记忆与适应优化这些能力;集体多智能体推理将智能延伸至涉及协作、知识共享与共同目标的协同场景。在这些层级中,我们区分了通过结构化编排扩展测试时交互的情境推理,与通过强化学习和监督微调优化行为的训练后推理。我们进一步综述了跨现实应用场景(包括科学、机器人、医疗、自主研究与数学领域)的代表性智能体推理框架与基准测试。本综述将智能体推理方法整合为连接思维与行动的统一路线图,并指出开放性挑战与未来方向,包括个性化、长周期交互、世界建模、可扩展多智能体训练以及现实部署的治理机制。
深度研究智慧體(DRAs)通過多步驟檢索與綜合生成帶有豐富引用的報告,然而現有基準主要針對純文本場景或短格式多模態問答,未能涵蓋端到端的多模態證據運用。我們提出MMDeepResearch-Bench(MMDR-Bench)——一個涵蓋21個領域共140項專家設計任務的基準,每個任務提供圖文組合包以評估多模態理解與引用錨定的報告生成能力。相較既有框架,MMDR-Bench強調需明確使用證據的報告式綜合能力,要求模型必須將視覺構件與來源主張相連結,並在敘事、引用和視覺參照間保持一致性。我們進一步設計了統一且可解釋的評估流程:針對報告質量的Formula-LLM自適應評估(FLAE)、用於引用錨定證據對齊的可信檢索校準評估(TRACE),以及檢驗圖文一致性的多模態支持校準完整性驗證(MOSAIC)。每項評估均產生細粒度信號,可支持超越單一總分的錯誤診斷。在25個前沿模型上的實驗揭示了生成質量、引用規範性與多模態錨定之間的系統性權衡,表明流暢的文本生成並不能保證忠實的證據運用,且多模態完整性仍是深度研究智慧體的關鍵瓶頸。
影片生成模型已顯著推進了具身智能的發展,為生成能捕捉物理世界中感知、推理與行動的多樣化機器人數據開闢了新可能。然而,合成能精準反映真實世界機器人互動的高品質影片仍具挑戰性,且缺乏標準化基準限制了公平比較與研究進展。為解決此問題,我們提出綜合性機器人基準RBench,針對五種任務領域與四種不同機器人本體評估機器人導向的影片生成。該基準透過可重現的子指標(包括結構一致性、物理合理性與動作完整性)同時評估任務層級的正確性與視覺逼真度。對25個代表性模型的評估結果顯示,現有模型在生成物理真實的機器人行為方面存在明顯不足。此外,RBench與人工評估的斯皮爾曼相關係數達0.96,驗證其有效性。儘管RBench提供了發現這些缺陷的必要視角,但要實現物理真實性,需超越評估層面以解決高品質訓練數據嚴重短缺的問題。基於此洞察,我們提出精煉的四階段數據流程,構建出RoVid-X——目前最大的開源機器人影片生成數據集,包含400萬個註解影片片段,涵蓋數千種任務並配備全面的物理屬性標註。此評估與數據協同的生態系統,為影片模型的嚴謹評估與可擴展訓練奠定了堅實基礎,加速具身人工智能向通用智能的演進。
撰寫有效的反駁意見是一項高風險任務,其要求遠超語言流利度,因為它需要精準對齊審稿人意圖與論文細節。現有解決方案通常將其視為直接文本生成問題,存在幻覺生成、遺漏批評及缺乏可驗證依據等缺陷。為解決這些限制,我們提出首個多智能體框架RebuttalAgent,將反駁意見生成重新定義為以證據為核心的規劃任務。我們的系統將複雜反饋分解為原子化問題點,通過融合壓縮摘要與高保真文本動態構建混合上下文,同時集成自主按需的外部搜索模塊以解決需引用外部文獻的問題。通過在起草前生成可審查的回應方案,RebuttalAgent確保每個論點都明確錨定於內部或外部證據。我們在新建的RebuttalBench上驗證方法,證明本流程在覆蓋率、忠實度與策略連貫性上均優於強基線模型,為同行評審提供透明可控的輔助工具。程式碼將公開釋出。
強化學習(RL)在後訓練階段具有核心地位,特別是對於需要專業推理行為的主體模型而言。在此背景下,模型融合提供了一種實用機制,可將來自不同任務的多個RL訓練主體整合為單一通用模型。然而,現有的融合方法專為監督式微調(SFT)設計,在保留RL訓練主體模型的任務專屬能力方面表現欠佳。其根源在於RL與SFT之間的任務向量不匹配:策略式RL產生的任務向量具有高度稀疏性與異質性,而SFT式融合隱含假設任務向量具備稠密性與全局可比性。當在此不匹配條件下應用標準全局平均法時,RL中編碼關鍵任務專屬行為的非重疊任務向量會被削弱,參數更新亦遭稀釋。為解決此問題,我們提出「強化主體融合」(RAM),這是一個專為RL訓練主體模型設計的分布感知融合框架。RAM能分離共享參數更新與任務專屬獨特參數更新,對共享組件進行平均處理的同時,選擇性保留並重新調整獨特組件,以抵消參數更新的稀釋效應。跨越多個主體領域與模型架構的實驗表明,RAM不僅超越現有融合基線,更能釋放主體間的協同潛力,實現優於各領域專屬主體的表現。
GutenOCR 是基於 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 微調得來的系列接地光學字元辨識前端模型。這些單一檢查點的視覺語言模型透過統一的提示式介面,實現了閱讀、檢測與定位功能。該模型使用商業文件、科學文獻及合成定位資料進行訓練,支援整頁與局部閱讀,並能提供行級與段落級邊界框,以及條件式「X在哪裡?」的查詢功能。我們提出了一套接地 OCR 評估方案,結果顯示 GutenOCR-7B 在 10.5K 份保留的商業與科學文件上,其綜合接地 OCR 分數較基礎模型 Qwen2.5-VL-7B 提升逾一倍(從 0.40 升至 0.82)。在 Fox 和 OmniDocBench v1.5 基準測試中,本方法顯著提升了區域/行級 OCR 效能與文字檢測召回率,但也在頁面線性化、色彩引導 OCR 及公式密集版式等場景中顯現出效能權衡。
思維鏈提示技術在釋放大語言模型的推理能力方面取得了顯著成功。儘管思維鏈提示能增強推理性能,但其冗長的特性帶來了巨大的計算開銷。現有研究往往僅關注結果對齊,缺乏對中間推理過程的監督機制,這些缺陷使得潛在推理鏈的可分析性變得模糊。為解決這些挑戰,我們提出了思維渲染框架——首個通過將文本推理步驟可視化為圖像來具象化推理鏈的架構,使潛在邏輯變得顯性化且可追溯。具體而言,我們利用現有視覺語言模型中視覺編碼器作為語義錨點,將視覺嵌入與文本空間對齊。這種設計確保了即插即用的實現方式,無需額外的預訓練成本。在數學與邏輯推理基準測試上的大量實驗表明,相較於顯性思維鏈,我們的方法能實現3-4倍的標記壓縮和顯著的推理加速,同時與其他方法保持競爭力,驗證了此範式的可行性。相關代碼已開源於:https://github.com/TencentBAC/RoT
文件擷取是數位工作流程的核心環節,然而現有的視覺語言模型主要偏向高資源語言。泰文由於非拉丁字母的文字複雜性、缺乏明確詞邊界,以及現實中高度非結構化文件的普遍存在,使得當前開源模型的效果受限。本文提出Typhoon OCR——一個專為泰文與英文設計的開放式視覺語言文件擷取模型。該模型基於視覺語言基礎架構,透過以泰文為核心的訓練數據集進行微調。該數據集採用多階段建構流程,結合傳統OCR技術、基於VLM的重構方法與精選合成數據。Typhoon OCR作為統一框架,能同時處理文字轉錄、版面重建與文件層級的結構一致性。我們的最新版本Typhoon OCR V1.5是具備推論效率的輕量化模型,旨在降低對元數據的依賴並簡化部署。針對泰文各類文件(含財務報表、政府表格、書籍、資訊圖表與手寫文件)的綜合評估顯示,Typhoon OCR在顯著降低計算成本的同時,達到與大型專有前沿模型相當或更優的效能。實驗結果證實,開放式視覺語言OCR模型能實現泰文文件的精準文字擷取與版面重建,在保持輕量可部署的優勢下,效能可媲美專有系統。
諸如Whisper之類的大型編碼器-解碼器模型雖能實現強大的離線語音轉寫能力,但因高延遲問題在串流應用中仍不具實用性。儘管預訓練模型易於取得,當前泰語自動語音辨識領域仍由這類離線架構主導,導致高效能串流解決方案存在關鍵缺口。我們提出Typhoon ASR Real-time——一款參數量達1.15億的FastConformer-Transducer模型,專為低延遲泰語語音辨識設計。我們證實嚴謹的文本正規化能達到與模型擴充相當的效果:相較Whisper Large-v3,我們的小型模型在保持相近準確度的同時,實現了45倍的運算成本降低。我們的正規化流程解決了泰語轉寫中的系統性歧義問題(包括情境依賴型數字口語化與重複標記「ไม้ยมก」),從而建立一致的訓練目標。此外,我們針對伊森(東北方言)適應性提出兩階段課程學習法,在維持中部泰語辨識效能的同時完成方言遷移。為解決泰語語音辨識的可重現性挑戰,我們發布Typhoon ASR Benchmark黃金標準人工標註資料集,其轉寫內容遵循既定泰語語言學規範,為研究社群提供標準化評估框架。
Agentic systems have recently become the dominant paradigm for formal theorem proving, achieving strong performance by coordinating multiple models and tools. However, existing approaches often rely on task-specific pipelines and trained formal provers, limiting their flexibility and reproducibility. In this paper, we propose the paradigm that directly uses a general coding agent as a formal math reasoner. This paradigm is motivated by (1) A general coding agent provides a natural interface for diverse reasoning tasks beyond proving, (2) Performance can be improved by simply replacing the underlying base model, without training, and (3) MCP enables flexible extension and autonomous calling of specialized tools, avoiding complex design. Based on this paradigm, we introduce Numina-Lean-Agent, which combines Claude Code with Numina-Lean-MCP to enable autonomous interaction with Lean, retrieval of relevant theorems, informal proving and auxiliary reasoning tools. Using Claude Opus 4.5 as the base model, Numina-Lean-Agent solves all problems in Putnam 2025 (12 / 12), matching the best closed-source system. Beyond benchmark evaluation, we further demonstrate its generality by interacting with mathematicians to successfully formalize the Brascamp-Lieb theorem. We release Numina-Lean-Agent and all solutions at https://github.com/project-numina/numina-lean-agent.
基于大语言模型的金融智能体正日益广泛地应用于投资分析、风险评估和自动化决策领域。这些智能体具备规划能力、工具调用能力及可变状态操作能力,但在高风险、强监管的金融环境中,这些能力也带来了新的安全风险。然而,现有安全评估主要聚焦于语言模型层面的内容合规性或抽象智能体场景,未能有效捕捉真实工作流程和状态变更操作中产生的执行层面风险。为弥补这一空白,我们提出首个面向金融智能体的执行安全基准测试框架FinVault,该框架包含31个基于监管案例的沙箱场景(配备可写入状态数据库和明确合规约束)、107种现实漏洞类型及963个测试用例,系统覆盖提示注入、越狱攻击、金融场景适应性攻击以及用于误报评估的良性输入。实验结果表明,现有防御机制在真实金融智能体环境中仍然存在不足:最先进模型的平均攻击成功率高达50.0%,即使对于最稳健的系统,攻击成功率仍保持不可忽视的水平(6.7%),这凸显出现有安全方案的可迁移性有限,亟需构建更强的金融场景专属防御体系。项目代码详见https://github.com/aifinlab/FinVault。
Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open-source, real-time, end-to-end spoken dialogue model that achieves both low-latency interaction and high-fidelity personalized voice cloning. Chroma achieves sub-second end-to-end latency through an interleaved text-audio token schedule (1:2) that supports streaming generation, while maintaining high-quality personalized voice synthesis across multi-turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real-Time Factor (RTF) of 0.43, while maintaining strong reasoning and dialogue capabilities. Our code and models are publicly available at https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma and https://huggingface.co/FlashLabs/Chroma-4B .
智能体AI正在重新定义检索技术,这要求超越传统基于相似度的范式,实现多模态推理。组合图像检索(CIR) exemplifies 这一变革,其每个查询都结合了参考图像与文本修改指令,需要跨模态的组合理解能力。虽然基于嵌入的CIR方法已取得进展,但其视角仍显局限——仅能捕捉有限的跨模态线索且缺乏语义推理能力。为突破这些限制,我们提出XR框架:一种无需训练的多智能体系统,将检索重构为渐进式协同推理过程。该系统协调三类专业智能体:想象智能体通过跨模态生成合成目标表征,相似性智能体通过混合匹配进行粗筛选,提问智能体通过针对性推理验证事实一致性以实现精筛选。通过渐进式多智能体协作,XR能迭代优化检索结果以满足语义与视觉的双重查询约束,在FashionIQ、CIRR和CIRCO数据集上相较强力的无训练及有训练基线方法提升达38%,消融实验证实各智能体均不可或缺。代码已开源:https://01yzzyu.github.io/xr.github.io/。
We introduce RoboBrain 2.5, a next-generation embodied AI foundation model that advances general perception, spatial reasoning, and temporal modeling through extensive training on high-quality spatiotemporal supervision. Building upon its predecessor, RoboBrain 2.5 introduces two major capability upgrades. Specifically, it unlocks Precise 3D Spatial Reasoning by shifting from 2D pixel-relative grounding to depth-aware coordinate prediction and absolute metric constraint comprehension, generating complete 3D manipulation traces as ordered keypoint sequences under physical constraints. Complementing this spatial precision, the model establishes Dense Temporal Value Estimation that provides dense, step-aware progress prediction and execution state understanding across varying viewpoints, producing stable feedback signals for downstream learning. Together, these upgrades extend the framework toward more physically grounded and execution-aware embodied intelligence for complex, fine-grained manipulation. The code and checkpoints are available at project website: https://superrobobrain.github.io
We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.
Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.
Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at https://github.com/tiktok/huvr.
Large Language Models have demonstrated profound utility in the medical domain. However, their application to autonomous Electronic Health Records~(EHRs) navigation remains constrained by a reliance on curated inputs and simplified retrieval tasks. To bridge the gap between idealized experimental settings and realistic clinical environments, we present AgentEHR. This benchmark challenges agents to execute complex decision-making tasks, such as diagnosis and treatment planning, requiring long-range interactive reasoning directly within raw and high-noise databases. In tackling these tasks, we identify that existing summarization methods inevitably suffer from critical information loss and fractured reasoning continuity. To address this, we propose RetroSum, a novel framework that unifies a retrospective summarization mechanism with an evolving experience strategy. By dynamically re-evaluating interaction history, the retrospective mechanism prevents long-context information loss and ensures unbroken logical coherence. Additionally, the evolving strategy bridges the domain gap by retrieving accumulated experience from a memory bank. Extensive empirical evaluations demonstrate that RetroSum achieves performance gains of up to 29.16% over competitive baselines, while significantly decreasing total interaction errors by up to 92.3%.
大型语言模型对提示结构展现出惊人的敏感性,但其内在机制尚未得到充分理解。本研究针对一个典型案例展开深入探讨:在多项选择题回答任务中,将语境置于问题和选项之前的排列方式(CQO)相较逆向排列(QOC)能持续产生超过14个百分点的性能提升,这一现象在不同模型与数据集间具有普适性。通过系统性架构分析,我们发现因果注意力机制是核心成因:在QOC提示中,因果掩码会阻止选项词元关注语境信息,形成语境对选项不可见的信息瓶颈。
本研究通过融合智能体级语义推理与快速局部控制,推动了自主机器人探索技术的发展。我们提出FARE——一种分层式自主探索框架,该框架将用于全局推理的大语言模型(LLM)与负责局部决策的强化学习(RL)策略相结合。FARE遵循快慢思维协同范式:慢思维LLM模块解析未知环境的简明文本描述,生成智能体级探索策略,并通过拓扑图将其具象化为全局航点序列;该模块还采用基于模块度的剪枝机制以减少冗余图结构,从而提升推理效率。快思维RL模块则在LLM生成的全局航点引导下,根据局部观测执行探索任务,其策略通过奖励项强化对全局航点的遵循,实现连贯稳健的闭环行为。该架构将语义推理与几何决策解耦,使各模块在适宜的时空尺度下运作。在具有挑战性的仿真环境中,实验结果表明FARE的探索效率相较现有最优基线方法实现显著提升。我们进一步将FARE部署于硬件平台,在200米×130米的大型复杂建筑环境中验证了其有效性。
Modern CI/CD pipelines integrating agent-generated code exhibit a structural failure in responsibility attribution. Decisions are executed through formally correct approval processes, yet no entity possesses both the authority to approve those decisions and the epistemic capacity to meaningfully understand their basis. We define this condition as responsibility vacuum: a state in which decisions occur, but responsibility cannot be attributed because authority and verification capacity do not coincide. We show that this is not a process deviation or technical defect, but a structural property of deployments where decision generation throughput exceeds bounded human verification capacity. We identify a scaling limit under standard deployment assumptions, including parallel agent generation, CI-based validation, and individualized human approval gates. Beyond a throughput threshold, verification ceases to function as a decision criterion and is replaced by ritualized approval based on proxy signals. Personalized responsibility becomes structurally unattainable in this regime. We further characterize a CI amplification dynamic, whereby increasing automated validation coverage raises proxy signal density without restoring human capacity. Under fixed time and attention constraints, this accelerates cognitive offloading in the broad sense and widens the gap between formal approval and epistemic understanding. Additional automation therefore amplifies, rather than mitigates, the responsibility vacuum. We conclude that unless organizations explicitly redesign decision boundaries or reassign responsibility away from individual decisions toward batch- or system-level ownership, responsibility vacuum remains an invisible but persistent failure mode in scaled agent deployments.
Korteweg-de Vries(KdV)方程作为非线性波物理的基础模型,描述了色散展宽与非线形陡化之间的平衡机制,这种平衡正是孤子产生的物理根源。本文介绍sangkuriang——一个基于Python的开源程序库,它采用傅里叶伪谱空间离散化结合自适应高阶时间积分法来求解该方程。该实现通过即时编译技术提升计算效率,同时保持教学场景下的易用性。验证工作涵盖逐级复杂的场景:包括孤立孤子传播、对称双波构型、异幅波间的超越碰撞以及三体相互作用。在整个模拟过程中持续监测经典不变量守恒情况,所有测试案例的偏差值均保持较小幅度。实测孤子速度与基于可积系统特有的振幅-速度关系所得理论预测高度吻合。结合信息论和递归分析的双重诊断证实,计算解保持了完全可积动力学预期的规则相空间结构。求解器以标准科学数据格式输出结果,兼容常用分析工具,并能生成时空波演化的可视化图像。sangkuriang在有限计算资源上兼具数值精度与实践易用性,为非线性波现象的课堂演示和孤子动力学的探索性研究提供了适恰的平台。
诸如ChatGPT Agent和GenSpark等网络AI代理正日益被用于常规网页任务,但它们仍依赖基于文本的输入提示,缺乏对用户意图的主动检测,且无法支持交互式数据分析和决策。我们推出WebSeek——一种混合主动式浏览器扩展,使用户能够从网页中发现并提取信息,进而在交互式画布中灵活构建、转换和优化具象数据产物(如表格、列表和可视化图表)。在该环境中,用户可执行包括连接表格或创建可视化等数据转换在内的分析操作,而内置AI既能主动提供情境感知的指导与自动化支持,也能响应用户的显式请求。一项以WebSeek为探针的探索性用户研究(N=15)揭示了参与者多样化的分析策略,凸显了他们在人机协作过程中对透明度和控制权的需求。
尽管已有大量研究聚焦于人工智能解释在事实核查等复杂信息检索任务中的决策支持作用,但证据的作用却鲜少受到关注。本研究系统调整了面向非专业参与者的解释类型、AI预测确定性及系统建议的正确性,要求参与者评估陈述与AI系统预测的真实性。实验设置了便于参与者随时查阅底层证据的选项。研究发现,在所有实验条件下,参与者始终依赖证据来验证AI声明。当提供自然语言解释时,证据使用频率虽有所下降,但当解释显得不足或有缺陷时参与者仍会诉诸证据。定性数据显示,尽管实验刻意隐去了来源身份,参与者仍试图推断证据源的可靠性。研究结果证实,证据是人们评估AI系统信息可靠性的关键要素,与自然语言解释相结合能为决策提供重要支持。当前亟需进一步研究证据的呈现方式及人们在实践中如何运用证据。
We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited training data and the inherent ambiguity of recovering geometry and motion from a monocular viewpoint. Motion 3-to-4 addresses these challenges by decomposing 4D synthesis into static 3D shape generation and motion reconstruction. Using a canonical reference mesh, our model learns a compact motion latent representation and predicts per-frame vertex trajectories to recover complete, temporally coherent geometry. A scalable frame-wise transformer further enables robustness to varying sequence lengths. Evaluations on both standard benchmarks and a new dataset with accurate ground-truth geometry show that Motion 3-to-4 delivers superior fidelity and spatial consistency compared to prior work. Project page is available at https://motion3-to-4.github.io/.
While large language models (LLMs) have shown to perform well on monolingual mathematical and commonsense reasoning, they remain unreliable for multilingual medical reasoning applications, hindering their deployment in multilingual healthcare settings. We address this by first introducing CUREMED-BENCH, a high-quality multilingual medical reasoning dataset with open-ended reasoning queries with a single verifiable answer, spanning thirteen languages, including underrepresented languages such as Amharic, Yoruba, and Swahili. Building on this dataset, we propose CURE-MED, a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization to jointly improve logical correctness and language stability. Across thirteen languages, our approach consistently outperforms strong baselines and scales effectively, achieving 85.21% language consistency and 54.35% logical correctness at 7B parameters, and 94.96% language consistency and 70.04% logical correctness at 32B parameters. These results support reliable and equitable multilingual medical reasoning in LLMs. The code and dataset are available at https://cure-med.github.io/