Search: AI performance claims

Paper page - ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

… The assurance layer includes a three-stage process for checking whether experimental claims are supported by evidence: integrity verification , result-to-claim mapping , and claim auditing that cross-checks manuscript statements against the claim ledger and raw evidence, as well as a five-pass scie… …

May 6, 2026

Paper page - MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

Papers arxiv:2605.08678 MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI Published on May 9 Submitted by Bohan22 on May 11 Authors: Bohan Lyu , , , Jiaru Zhang , Qixin Xu , , , , , , , , , Junlin Yang , , , , , , , , Abstract Current AI agents struggle to invent gen… …

May 12, 2026

Paper page - Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding

… With a new formalization, we present multiple lines of evidence to support our position: Firstly, we review recent empirical practices and emerging methodologies that demonstrate the substantial performance gains achievable through deliberate language representation design, even without modifying m… …

May 12, 2026

Paper page - Can LLMs Introspect? A Reality Check

… Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. …

May 27, 2026

Paper page - PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

… The following papers were recommended by the Semantic Scholar API Adaptive Cost-Efficient Evaluation for Reliable Patent Claim Validation 2026 MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation 2026 Beyond Rating: A Comprehensive Evaluation and Benchm… …

May 6, 2026

Paper page - The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

… AI-generated summary On-policy distillation OPD is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda 1, the student can lift past the teacher in domain, but past a threshold lambda the same step violates the output contract on structured-output tasks . …

May 14, 2026

Paper page - Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

… AI-generated summary Despite the growing popularity of Multimodal Domain Generalization MMDG for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. …

May 8, 2026

Paper page - Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Papers arxiv:2605.09708 Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon Published on May 10 Submitted by Victor Gallego on May 12 Authors: Víctor Gallego Abstract A benchmark for optimizing scientific computing kernels on Apple Silicon is paired with an… …

May 12, 2026

Paper page - EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

… We really appreciate your interest and feedback ^ ^. the four-way taxonomy of recognition errors and the diagnostic LLM detector that ties those errors to downstream grading is the standout part for me. it's refreshing to see upstream recognition and autograding evaluated together, and the claim th… …

May 8, 2026

Paper page - Let ViT Speak: Generative Language-Image Pre-training

Papers arxiv:2605.00809 Let ViT Speak: Generative Language-Image Pre-training Published on May 1 Submitted by taesiri on May 4 3 Paper of the day ByteDance Authors: , , Zilong Huang , , , , , , , Abstract GenLIP is a minimalist generative pretraining framework for Vision Transformers that directly … …

May 4, 2026

Followed topics

Paper page - ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

Paper page - MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

Paper page - Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding

Paper page - Can LLMs Introspect? A Reality Check

Paper page - PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

Paper page - The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Paper page - Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

Paper page - Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Paper page - EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Paper page - Let ViT Speak: Generative Language-Image Pre-training