Search

Showing top 4 results for "GPT-5.5 transition"

… RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini +3.8pp, 95% CI +2.2,+5.4 , is within sampling noise on GPT-medium and Claude, an… …

May 29, 2026

Paper page - τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems

… Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap … …

Jun 11, 2026

Paper page - GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

… We first conduct a systematic evaluation of nine state-of-the-art MFMs, including Nano-Banana-2 and GPT-Image-2, on images of various scenes and degradation types. …

Jun 1, 2026

Paper page - Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

… On WebArena, SGDR improves over strong online skill-learning baselines across five domains, achieving 37.5% success rate with GPT-4.1 and 24.3% with Qwen3-4B, while also reducing the average number of steps. …

Jun 10, 2026

Followed topics

Search

GPT-5

Paper page - REPOT: Recoverable Program-of-Thought via Checkpoint Repair

Paper page - τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems

Paper page - GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

Paper page - Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval