Search: Model performance claims

Paper page - ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

… The performance of agent systems built on LLMs depends on both the model weights and the harness around them, which governs what information to store, retrieve, and present to the model. …

May 6, 2026

Paper page - Can LLMs Introspect? A Reality Check

… Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions , indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. …

May 27, 2026

Paper page - Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding

Papers arxiv:2605.09271 Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding Published on May 10 Submitted by zhiqin yang on May 12 Authors: , , , , , Abstract Language representation design significantly impacts large language model performance and interna… …

May 12, 2026

Paper page - MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

… Get this paper in your agent: hf papers read 2605.08678 Don't have the latest CLI? curl -LsSf https://hf.co/cli/install.sh | bash No model linking this paper Cite arxiv.org/abs/2605.08678 in a model README.md to link it from this page. …

May 12, 2026

Paper page - Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

… The following papers were recommended by the Semantic Scholar API MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization 2026 MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities 2026 Enhance-then-Balance Modality Collaboration for Robust M… …

May 8, 2026

Paper page - PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

… Extensive experiments across various LLMs reveal critical insights into model performance, including differences between proprietary and open-source models, as well as task asymmetries between examiner analysis and applicant-side rebuttal. …

May 6, 2026

Paper page - Let ViT Speak: Generative Language-Image Pre-training

… This design offers three key advantages: 1 Simplicity: a single transformer jointly models visual and textual tokens; 2 Scalability: it scales effectively with both data and model size; and 3 Performance: it achieves competitive or superior results across diverse multimodal benchmarks . …

May 4, 2026

Paper page - RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

Papers arxiv:2605.04523 RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation Published on May 6 Submitted by Ivan Bondarenko on May 8 Novosibirsk State University Authors: Ivan Bondarenko , Roman Derunets , , , Ivan Chern… …

May 8, 2026

Paper page - Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Papers arxiv:2605.09708 Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon Published on May 10 Submitted by Victor Gallego on May 12 Authors: Víctor Gallego Abstract A benchmark for optimizing scientific computing kernels on Apple Silicon is paired with an… …

May 12, 2026

Paper page - EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

… We really appreciate your interest and feedback ^ ^. the four-way taxonomy of recognition errors and the diagnostic LLM detector that ties those errors to downstream grading is the standout part for me. it's refreshing to see upstream recognition and autograding evaluated together, and the claim th… …

May 8, 2026

Followed topics

Paper page - ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

Paper page - Can LLMs Introspect? A Reality Check

Paper page - Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding

Paper page - MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

Paper page - Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

Paper page - PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

Paper page - Let ViT Speak: Generative Language-Image Pre-training

Paper page - RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

Paper page - Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Paper page - EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions