Search: GPT-5

Paper page - RewardHarness: Self-Evolving Agentic Post-Training

…Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward…

May 14, 2026

Paper page - IntentGrasp: A Comprehensive Benchmark for Intent Understanding

…Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60…

May 11, 2026

Paper page - Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

…On SCOUT-450 , a safety-oriented operating point reduces attack-success rate by 46% and total wall-clock by 40% relative to an always-on GPT-4o judge , at a 5.1…

Jun 9, 2026

Paper page - BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

…Fine-tuning a 4-billion-parameter LLM on BioTool yields substantial improvements in biomedical tool-calling performance, outperforming cutting-edge commercial LLMs such as GPT-5.1. Furthermore, human expert evaluations demonstrate…

May 8, 2026

Paper page - Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

…An 8B model trained with DCRL hits 70.5 F1 and beats GPT-5-mini on ReasonMatch-Bench—nice evidence that geometric supervision + RL can unlock spatial reasoning without CoT labels. This…

Jun 4, 2026

Paper page - IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

…safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5…

May 13, 2026

Paper page - AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

…Across AFTraj-2K and an external Who\&When benchmark, AgentForesight-7B outperforms leading proprietary models, including GPT-4.1 and DeepSeek-V4-Pro , achieving up to +19.9% performance gain and 3times…

May 12, 2026

Paper page - AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

…On five tasks under a shared GPT-5.5 backbone, AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction that reduces lower-wall Cf RMSE against DNS by 7.89% on…

May 15, 2026

Paper page - InterleaveThinker: Reinforcing Agentic Interleaved Generation

…On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX…

Jun 12, 2026

Paper page - Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

…In pairwise speaker-listener experiments with GPT-OSS-120b and Qwen3.5, we show these languages can be learned in-context from a short description alone, with oversight-evasion grammars no harder…

Jun 1, 2026

Followed topics

GPT-5