Artículos de investigación en IA seleccionados diariamente con traducciones
Los Autoencoders Dispersos (SAE) han surgido como una herramienta prometedora para interpretar redes neuronales al descomponer sus activaciones en conjuntos dispersos de características interpretables para humanos. Trabajos recientes han introducido múltiples variantes de SAE y las han escalado con éxito a modelos de vanguardia. A pesar del gran entusiasmo, un número creciente de resultados negativos en tareas posteriores pone en duda si los SAE recuperan características significativas. Para investigar esto directamente, realizamos dos evaluaciones complementarias. En una configuración sintética con características de verdad fundamental conocidas, demostramos que los SAE recuperan solo el 9% de las características verdaderas a pesar de alcanzar un 71% de varianza explicada, lo que muestra que fallan en su tarea principal incluso cuando la reconstrucción es sólida. Para evaluar los SAE en activaciones reales, introducimos tres líneas de base que restringen las direcciones de las características del SAE o sus patrones de activación a valores aleatorios. A través de experimentos exhaustivos en múltiples arquitecturas de SAE, mostramos que nuestras líneas de base igualan a los SAE completamente entrenados en interpretabilidad (0.87 vs 0.90), sondeo disperso (0.69 vs 0.72) y edición causal (0.73 vs 0.72). En conjunto, estos resultados sugieren que los SAE en su estado actual no descomponen de manera confiable los mecanismos internos de los modelos.
Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.
Presentamos GLM-5, un modelo de base de próxima generación diseñado para transicionar el paradigma de la programación por ambiente hacia la ingeniería agentiva. Basándose en las capacidades de agentividad, razonamiento y programación (ARC) de su predecesor, GLM-5 adopta DSA para reducir significativamente los costos de entrenamiento e inferencia manteniendo la fidelidad de contexto largo. Para avanzar en la alineación y autonomía del modelo, implementamos una nueva infraestructura de aprendizaje por refuerzo asíncrono que mejora drásticamente la eficiencia posterior al entrenamiento al desacoplar la generación del entrenamiento. Además, proponemos nuevos algoritmos asíncronos de RL para agentes que mejoran aún más la calidad del RL, permitiendo que el modelo aprenda de interacciones complejas y de largo horizonte de manera más efectiva. A través de estas innovaciones, GLM-5 logra un rendimiento de vanguardia en los principales benchmarks abiertos. Más críticamente, GLM-5 demuestra una capacidad sin precedentes en tareas de programación del mundo real, superando los baselines anteriores en el manejo de desafíos de ingeniería de software de extremo a extremo. El código, los modelos y más información están disponibles en https://github.com/zai-org/GLM-5.
As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agents participate in an open-ended, continuously evolving online society. We present the first large-scale systemic diagnosis of this AI agent society. Beyond static observation, we introduce a quantitative diagnostic framework for dynamic evolution in AI agent societies, measuring semantic stabilization, lexical turnover, individual inertia, influence persistence, and collective consensus. Our analysis reveals a system in dynamic balance in Moltbook: while global semantic averages stabilize rapidly, individual agents retain high diversity and persistent lexical turnover, defying homogenization. However, agents exhibit strong individual inertia and minimal adaptive response to interaction partners, preventing mutual influence and consensus. Consequently, influence remains transient with no persistent supernodes, and the society fails to develop stable collective influence anchors due to the absence of shared social memory. These findings demonstrate that scale and interaction density alone are insufficient to induce socialization, providing actionable design and analysis principles for upcoming next-generation AI agent societies.
We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.
Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.
The Platonic Representation Hypothesis suggests that representations from neural networks are converging to a common statistical model of reality. We show that the existing metrics used to measure representational similarity are confounded by network scale: increasing model depth or width can systematically inflate representational similarity scores. To correct these effects, we introduce a permutation-based null-calibration framework that transforms any representational similarity metric into a calibrated score with statistical guarantees. We revisit the Platonic Representation Hypothesis with our calibration framework, which reveals a nuanced picture: the apparent convergence reported by global spectral measures largely disappears after calibration, while local neighborhood similarity, but not local distances, retains significant agreement across different modalities. Based on these findings, we propose the Aristotelian Representation Hypothesis: representations in neural networks are converging to shared local neighborhood relationships.
Post-training compression of Transformer models commonly relies on truncated singular value decomposition (SVD). However, enforcing a single shared subspace can degrade accuracy even at moderate compression. Sparse dictionary learning provides a more flexible union-of-subspaces representation, but existing approaches often suffer from iterative dictionary and coefficient updates. We propose COMPOT (Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers), a training-free compression framework that uses a small calibration dataset to estimate a sparse weight factorization. COMPOT employs orthogonal dictionaries that enable closed-form Procrustes updates for the dictionary and analytical single-step sparse coding for the coefficients, eliminating iterative optimization. To handle heterogeneous layer sensitivity under a global compression budget, COMPOT further introduces a one-shot dynamic allocation strategy that adaptively redistributes layer-wise compression rates. Extensive experiments across diverse architectures and tasks show that COMPOT consistently delivers a superior quality-compression trade-off over strong low-rank and sparse baselines, while remaining fully compatible with post-training quantization for extreme compression. Code is available https://github.com/mts-ai/COMPOT{here}.
Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of "generate-understand-regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.
Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.
Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.
Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and user-specific data. A common approach is retrieval-augmented generation (RAG), which stores verbatim documents externally (as chunks) and retrieves only a relevant subset at inference time for an LLM to reason over. However, this results in inefficient usage of test-time compute (LLM repeatedly reasons over the same documents); moreover, chunk retrieval can inject irrelevant context that increases unsupported generation. We propose a human-like non-parametric continual learning framework, where the base model remains fixed, and learning occurs by integrating each new experience into an external semantic memory state that accumulates and consolidates itself continually. We present Panini, which realizes this by representing documents as Generative Semantic Workspaces (GSW) -- an entity- and event-aware network of question-answer (QA) pairs, sufficient for an LLM to reconstruct the experienced situations and mine latent knowledge via reasoning-grounded inference chains on the network. Given a query, Panini only traverses the continually-updated GSW (not the verbatim documents or chunks), and retrieves the most likely inference chains. Across six QA benchmarks, Panini achieves the highest average performance, 5%-7% higher than other competitive baselines, while using 2-30x fewer answer-context tokens, supports fully open-source pipelines, and reduces unsupported answers on curated unanswerable queries. The results show that efficient and accurate structuring of experiences at write time -- as achieved by the GSW framework -- yields both efficiency and reliability gains at read time. Code is available at https://github.com/roychowdhuryresearch/gsw-memory.
El Aprendizaje por Refuerzo (RL) ha mejorado significativamente el razonamiento de los modelos de lenguaje grandes, pero los métodos existentes de ajuste fino mediante RL dependen en gran medida de técnicas heurísticas, como la regularización de entropía y la reponderación, para mantener la estabilidad. En la práctica, a menudo experimentan un colapso del rendimiento en etapas tardías, lo que conduce a una calidad de razonamiento degradada y a un entrenamiento inestable. Derivamos que la magnitud de los gradientes de política por token en RL está negativamente correlacionada con la probabilidad del token y la entropía local de la política. Basándonos en este resultado, demostramos que la inestabilidad del entrenamiento está impulsada por una pequeña fracción de tokens, aproximadamente el 0,01%, que denominamos *tokens espurios*. Cuando estos tokens aparecen en respuestas correctas, contribuyen poco al resultado del razonamiento pero heredan la recompensa completa a nivel de secuencia, lo que lleva a actualizaciones de gradiente anormalmente amplificadas. Motivados por esta observación, proponemos la Optimización de Políticas con Conciencia de Tokens Espurios (STAPO) para el refinamiento de modelos a gran escala, que enmascara selectivamente dichas actualizaciones y renormaliza la pérdida sobre los tokens válidos. En seis benchmarks de razonamiento matemático utilizando los modelos base Qwen 1.7B, 8B y 14B, STAPO demuestra consistentemente una estabilidad de entropía superior y logra una mejora promedio en el rendimiento del 7,13% sobre GRPO, 20-Entropy y JustRL.
The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.
Los modelos predictivos del mundo que simulan observaciones futuras bajo control explícito de la cámara son fundamentales para la IA interactiva. A pesar de los rápidos avances, los sistemas actuales carecen de persistencia espacial: no logran mantener estructuras estables de la escena en trayectorias largas, alucinando con frecuencia detalles cuando las cámaras revisitan ubicaciones previamente observadas. Identificamos que esta deriva geométrica surge de la dependencia de *embeddings* posicionales en el espacio de pantalla, los cuales entran en conflicto con la geometría proyectiva requerida para la consistencia 3D. Presentamos ViewRope, una codificación consciente de la geometría que inyecta direcciones de rayos de cámara directamente en las capas de auto-atención de los transformadores de video. Al parametrizar la atención con geometría de rayos relativa en lugar de la localidad de píxeles, ViewRope proporciona un sesgo inductivo nativo del modelo para recuperar contenido 3D-consistente a través de intervalos temporales. Además, proponemos la Atención Esparcida entre Fotogramas Consciente de la Geometría, que explota estas señales geométricas para atender selectivamente a fotogramas históricos relevantes, mejorando la eficiencia sin sacrificar la consistencia de la memoria. También presentamos ViewBench, un conjunto de herramientas de diagnóstico que mide la fidelidad de cierre de bucles y la deriva geométrica. Nuestros resultados demuestran que ViewRope mejora sustancialmente la consistencia a largo plazo mientras reduce los costos computacionales.
Aunque los grandes modelos de lenguaje (LLM) demuestran conocimientos médicos de nivel experto, alinear sus respuestas de código abierto con las preferencias detalladas de los clínicos sigue siendo un desafío. Los métodos existentes a menudo dependen de objetivos generales o de evaluadores automáticos poco fiables que tienen una base débil en las directrices profesionales. Proponemos un marco de trabajo de dos etapas para abordar esta brecha. Primero, presentamos HealthRubrics, un conjunto de datos de 7,034 ejemplos de preferencias verificados por médicos, en los que los clínicos refinan rúbricas redactadas por LLM para cumplir con rigurosos estándares médicos. Segundo, destilamos estas rúbricas en HealthPrinciples: 119 principios ampliamente reutilizables y basados en la clínica, organizados por dimensiones clínicas, lo que permite una supervisión escalable más allá de la anotación manual. Utilizamos HealthPrinciples para (1) la alineación offline mediante la síntesis de rúbricas para consultas no etiquetadas y (2) como una herramienta en tiempo de inferencia para una autorevisión guiada. Un modelo de 30B de parámetros que activa solo 3B de parámetros durante la inferencia, entrenado con nuestro marco, alcanza un 33.4% en HealthBench-Hard, superando a modelos mucho más grandes como Deepseek-R1 y o3, estableciendo así un punto de referencia eficiente en recursos para la alineación clínica.
Para el despliegue de modelos fundacionales, los profesionales necesitan cada vez más leyes de escalado prescriptivas: dado un presupuesto de cómputo para el preentrenamiento, ¿qué precisión *downstream* es alcanzable con las prácticas contemporáneas de postentrenamiento, y cuán estable es esa correlación a medida que el campo evoluciona? Utilizando evaluaciones observacionales a gran escala con 5k datos observacionales y 2k datos recién muestreados sobre el rendimiento de modelos, estimamos fronteras de capacidad, cuantiles condicionales altos de las puntuaciones en *benchmarks* en función del logaritmo de los FLOPS de preentrenamiento, mediante regresión cuantílica suavizada con una parametrización sigmoide monótona y saturable. Validamos la confiabilidad temporal ajustando el modelo a generaciones anteriores de modelos y evaluando en lanzamientos posteriores. En diversas tareas, las fronteras estimadas son mayormente estables, con la excepción del razonamiento matemático, que exhibe una frontera en avance constante en el tiempo. Luego, extendemos nuestro enfoque para analizar la saturación dependiente de la tarea y para investigar los desplazamientos relacionados con la contaminación en tareas de razonamiento matemático. Finalmente, introducimos un algoritmo eficiente que recupera las fronteras de datos casi completas utilizando aproximadamente el 20% del presupuesto de evaluación. En conjunto, nuestro trabajo publica Proteus 2k, el conjunto de datos de evaluación de rendimiento de modelos más reciente, e introduce una metodología práctica para traducir presupuestos de cómputo en expectativas de rendimiento confiables y para monitorear cuándo las fronteras de capacidad cambian a lo largo del tiempo.
Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.
Los modelos del mundo requieren una comprensión relacional robusta para sustentar la predicción, el razonamiento y el control. Si bien las representaciones céntricas en objetos proporcionan una abstracción útil, no son suficientes para capturar dinámicas dependientes de interacciones. Por lo tanto, proponemos C-JEPA, un modelo del mundo céntrico en objetos, simple y flexible, que extiende la predicción de embeddings conjuntos enmascarados desde parches de imagen a representaciones céntricas en objetos. Al aplicar un enmascaramiento a nivel de objeto que requiere inferir el estado de un objeto a partir de otros objetos, C-JEPA induce intervenciones latentes con efectos contrafactuales y evita soluciones por atajos, haciendo que el razonamiento sobre interacciones sea esencial. Empíricamente, C-JEPA produce mejoras consistentes en la respuesta a preguntas visuales, con una mejora absoluta de aproximadamente el 20% en el razonamiento contrafactual en comparación con la misma arquitectura sin enmascaramiento a nivel de objeto. En tareas de control de agentes, C-JEPA permite una planificación sustancialmente más eficiente al utilizar solo el 1% de las características latentes de entrada totales requeridas por los modelos del mundo basados en parches, logrando un rendimiento comparable. Finalmente, proporcionamos un análisis formal que demuestra que el enmascaramiento a nivel de objeto induce un sesgo inductivo causal mediante intervenciones latentes. Nuestro código está disponible en https://github.com/galilai-group/cjepa.
El procesamiento eficiente de contextos largos sigue siendo un desafío crucial para los modelos de lenguaje grandes (LLM) contemporáneos, especialmente en entornos con recursos limitados. Las arquitecturas de compresión blanda prometen extender la longitud efectiva del contexto reemplazando secuencias largas de tokens con conjuntos más pequeños de tokens comprimidos aprendidos. Sin embargo, los límites de la compresibilidad —y cuándo la compresión comienza a eliminar contenido relevante para la tarea— siguen estando poco explorados. En este artículo, definimos el desbordamiento de tokens como un régimen en el que las representaciones comprimidas ya no contienen información suficiente para responder a una consulta dada, y proponemos una metodología para caracterizarlo y detectarlo. En el entorno de compresión blanda xRAG, encontramos que las estadísticas de saturación independientes de la consulta separan de manera confiable las representaciones de tokens comprimidos de las no comprimidas, proporcionando una herramienta práctica para identificar tokens comprimidos pero mostrando una capacidad limitada para detectar desbordamiento. Clasificadores de sondeo ligeros sobre las representaciones xRAG tanto de la consulta como del contexto detectan desbordamiento con un AUC-ROC promedio de 0.72 en los conjuntos de datos HotpotQA, SQuADv2 y TriviaQA, demostrando que incorporar información de la consulta mejora el rendimiento de la detección. Estos resultados avanzan desde diagnósticos independientes de la consulta hacia detectores conscientes de ella, permitiendo una selección previa al LLM de bajo coste para mitigar errores inducidos por la compresión.
Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port for inter-agent telepathy. Our framework adopts a hub-and-spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label-free, teacher-student distillation objective to align the high-speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen-VL, Gemma) demonstrate that the Vision Wormhole reduces end-to-end wall-clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text-based MAS. Code is available at https://github.com/xz-liu/heterogeneous-latent-mas
Clawdbot es un agente de IA personal autoalojado que utiliza herramientas, con un amplio espacio de acción que abarca desde la ejecución local hasta flujos de trabajo mediados por la web, lo que plantea mayores preocupaciones de seguridad y protección bajo ambigüedad y direccionamiento adverso. Presentamos una evaluación centrada en trayectorias de Clawdbot a lo largo de seis dimensiones de riesgo. Nuestra suite de pruebas toma muestras y adapta ligeramente escenarios de benchmarks previos de seguridad de agentes (incluyendo ATBench y LPS-Bench) y los complementa con casos diseñados manualmente adaptados a la superficie de herramientas de Clawdbot. Registramos trayectorias de interacción completas (mensajes, acciones, argumentos/salidas de llamadas a herramientas) y evaluamos la seguridad utilizando tanto un juez de trayectorias automatizado (AgentDoG-Qwen3-4B) como revisión humana. En 34 casos canónicos, encontramos un perfil de seguridad no uniforme: el rendimiento es generalmente consistente en tareas centradas en la confiabilidad, mientras que la mayoría de los fallos surgen bajo intención poco especificada, objetivos abiertos o indicaciones de jailbreak de apariencia benigna, donde pequeñas interpretaciones erróneas pueden escalar hacia acciones de herramientas de mayor impacto. Complementamos los resultados generales con estudios de caso representativos y resumimos las características comunes de estos casos, analizando las vulnerabilidades de seguridad y los modos de fallo típicos que Clawdbot tiende a desencadenar en la práctica.
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://github.com/SKYLENAGE-AI/HLE-Verified
Large language models (LLMs) continue to struggle with knowledge-intensive questions that require up-to-date information and multi-hop reasoning. Augmenting LLMs with hybrid external knowledge, such as unstructured text and structured knowledge graphs, offers a promising alternative to costly continual pretraining. As such, reliable evaluation of their retrieval and reasoning capabilities becomes critical. However, many existing benchmarks increasingly overlap with LLM pretraining data, which means answers or supporting knowledge may already be encoded in model parameters, making it difficult to distinguish genuine retrieval and reasoning from parametric recall. We introduce HybridRAG-Bench, a framework for constructing benchmarks to evaluate retrieval-intensive, multi-hop reasoning over hybrid knowledge. HybridRAG-Bench automatically couples unstructured text and structured knowledge graph representations derived from recent scientific literature on arXiv, and generates knowledge-intensive question-answer pairs grounded in explicit reasoning paths. The framework supports flexible domain and time-frame selection, enabling contamination-aware and customizable evaluation as models and knowledge evolve. Experiments across three domains (artificial intelligence, governance and policy, and bioinformatics) demonstrate that HybridRAG-Bench rewards genuine retrieval and reasoning rather than parametric recall, offering a principled testbed for evaluating hybrid knowledge-augmented reasoning systems. We release our code and data at github.com/junhongmit/HybridRAG-Bench.