Search: performance benchmarking

Paper page - LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

…Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales…

May 11, 2026

Paper page - Audio-Visual Intelligence in Large Foundation Models

…A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos (2026) AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video…

May 8, 2026

Paper page - The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

…An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech (2026) Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation (2026) LASE: Language-Adversarial Speaker…

May 6, 2026

Paper page - VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

…Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime…

Jun 1, 2026

Paper page - The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

…Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement . Benchmark is publicly available at: https://github.com/ant…

Jun 4, 2026

Paper page - WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

…WorldMemArena is a new benchmark evaluating the multimodal memory of long-horizon agents using a four-stage Action-World Interaction Loop and multi-session tasks for detailed performance diagnostics. This is an…

May 29, 2026

Paper page - Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

…Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from…

Apr 30, 2026

Paper page - X2SAM: Any Segmentation in Images and Videos

…With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability…

May 6, 2026

Paper page - Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

…Yilun Zhao , , Tingyu Song , , , Abstract Researchers introduce BRIGHT-Pro, an expanded expert-annotated benchmark for reasoning-intensive retrieval, and RTriever-Synth, an aspect-decomposed synthetic corpus, to improve retriever performance through agentic…

May 7, 2026

Paper page - Counting as a minimal probe of language model reliability

…AI-generated summary Large language models perform strongly on benchmarks in mathematical reasoning , coding and document analysis , suggesting a broad ability to follow instructions. However, it remains unclear whether such success reflects…

May 5, 2026

Followed topics

Search

Paper page - LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Top stories

Paper page - Benchmarking Visual State Tracking in Multimodal Video Understanding

Paper page - SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Paper page - MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

Paper page - 3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code