Search

Showing top 63 results for "GPT-5"

GPT-5

198 articles indexed Last updated 5h ago See topic hub

Paper page - Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

…We report matched single-model sweeps of Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5 on M1 Pro: in-distribution self-speedups span 1.00times to 10.7times…

May 12, 2026

Paper page - K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

…On this subset, frontier LLMs , including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp , while Korean LLMs released through…

Jun 2, 2026

Paper page - Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

…Specifically, Pi-Serini with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall , outperforming released search agent s that use dense retrievers . Controlled ablations further show that…

May 12, 2026

Paper page - τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems

…Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep…

Jun 11, 2026

Easily Build and Share ROCm Kernels with Hugging Face

…https://huggingface.co/kernels-community/gpt-oss-metal-kernels

Nov 17, 2025 · Abdennacer Badaoui

Paper page - SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

…Proposes a standardized GPT-5-mini judge protocol scoring trajectories on start-location consistency, goal satisfaction, obstacle avoidance, and trajectory efficiency , showing GPT-5-mini leads among tested models but still degrades…

May 13, 2026

Paper page - Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

…On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open…

May 12, 2026

Paper page - The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

…Across models the effect tracks the reasoning channel (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it). An independent GPT-4o judge corroborates 86% of…

Jun 2, 2026

Paper page - Pruning and Distilling Mixture-of-Experts into Dense Language Models

…We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS…

Jun 9, 2026

Paper page - DEI: Diversity in Evolutionary Inference for Quality-Diversity Search

…Evaluated on the Core War domain, a competitive programming benchmark in which Redcode warrior programs battle inside a simulated machine, a four-node heterogeneous ensemble (GPT-5.4-mini, Claude Sonnet 4…

Jun 9, 2026

Followed topics

GPT-5

Paper page - Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Paper page - K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Paper page - Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

Paper page - τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems

Easily Build and Share ROCm Kernels with Hugging Face

Paper page - SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

Paper page - Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Paper page - The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

Paper page - Pruning and Distilling Mixture-of-Experts into Dense Language Models

Paper page - DEI: Diversity in Evolutionary Inference for Quality-Diversity Search