Search

Showing top 114 results for "Benchmarks and reliability"

All sources huggingface.co 30 wccftech.com 15 techpowerup.com 6 xda-developers.com 6 anthropic.com 5 amd.com 5 developer.nvidia.com 4 tweaktown.com 4 blogs.nvidia.com 3 tomshardware.com 3 newsroom.intel.com 2 windowscentral.com 2

Paper page - Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

… Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. …

May 14, 2026

Paper page - Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

… However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grained evaluation protocols. …

May 14, 2026

H100 vs GB200 NVL72 Training Benchmarks - Power, TCO, and Reliability Analysis, Software Improvement Over Time

H100 vs GB200 NVL72 Training Benchmarks - Power, TCO, and Reliability Analysis, Software Improvement Over Time Joules per Token, TCO Per Million Tokens, MFU, Tokens Per US Annual Household Energy Usage, DeepSeek 670B, GB200 Unreliability, Backplane Downtime Frontier model training has pushed GPUs a… …

Aug 20, 2025 · Dylan Patel

Intel Core Ultra 9 290HX Plus Benchmark Leak Shows Slightly Higher Single And MT Performance Vs Ultra 9 285HX

… According to the leaked benchmarks, the 290HX Plus is delivering 3,153 points in single-core and 21,720 points in multi-core tests. …

Mar 2, 2026 · Sarfraz Khan

Discussions and forums

r/LocalLLaMA · u/Glittering_Focus1538 · 1w ago

I built a coding agent that gets 87% on benchmarks with a 4B parameter model, here's how

I was frustrated that every coding agent (OpenCode, Cursor, Claude Code) assumes you're running GPT-5.4 or Claude Opus. If you try them with a local model like Gemma or Qwen they fall apart. I find that often tool calls …

Hacker News · u/zambelli · 1w ago

Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.What it does:- Adds domain-and-tool-agnostic guardrails (retry nudges, step e…

660 240

Paper page - An Empirical Study of Automating Agent Evaluation

…AI-generated summary Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably…

May 14, 2026

Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench

… To answer this, the research community has built several benchmarks. …

Apr 29, 2026

Intel Core Ultra X7 358H Comes Out Noticeably Slower Than Ultra 7 265H On PassMark

Once again, the Core Ultra X7 358H was benchmarked, but this time we have a much more reliable benchmark platform than Geekbench. Core Ultra X7 358H is 15% Slower Than Core Ultra…

Nov 14, 2025 · Sarfraz Khan

Eval awareness in Claude Opus 4.6’s BrowseComp performance

… This finding raises questions about whether static benchmarks remain reliable when run in web-enabled environments. …

Mar 6, 2026

Mastering Agentic Techniques: AI Agent Evaluation | NVIDIA Technical Blog

… While model and agent evaluation are inextricably linked, their technical benchmarks and metrics for success are fundamentally different. …

May 19, 2026 · Edward Li

Paper page - RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

… Existing ICU benchmarks typically treat historical clinician actions as ground truth. …

May 14, 2026

Followed topics

Paper page - Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Paper page - Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

H100 vs GB200 NVL72 Training Benchmarks - Power, TCO, and Reliability Analysis, Software Improvement Over Time

Intel Core Ultra 9 290HX Plus Benchmark Leak Shows Slightly Higher Single And MT Performance Vs Ultra 9 285HX

Discussions and forums

I built a coding agent that gets 87% on benchmarks with a 4B parameter model, here's how

Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

Paper page - An Empirical Study of Automating Agent Evaluation

Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench

Intel Core Ultra X7 358H Comes Out Noticeably Slower Than Ultra 7 265H On PassMark

Eval awareness in Claude Opus 4.6’s BrowseComp performance

Mastering Agentic Techniques: AI Agent Evaluation | NVIDIA Technical Blog

Paper page - RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation