Search

Showing top 114 results for "real-world evaluation"

All sources huggingface.co 45 anthropic.com 13 developer.nvidia.com 12 blogs.nvidia.com 7 amd.com 6 spectrum.ieee.org 4 blog.google 3 wccftech.com 2 techcrunch.com 2 xda-developers.com 2 techpowerup.com 2 tweaktown.com 2

A New Approach for Evaluating AI Model Fairness

…in the world of AI. You can listen to the full conversation here . This conversation has been edited and condensed for brevity and clarity. An Alternative Way to Evaluate Model Fairness Katherine…

· Hosted by Katherine Druckman

Windows 11's new driver allow-list could break your old hardware, and Microsoft won't tell you what's on it

…Instead of having vendors apply or consulting end users, Microsoft built it internally, using what it describes as "billions of driver load signals and real-world usage data" gathered across Windows 11…

May 27, 2026 · Ty Sherback

Introducing Claude Opus 4.5

…Claude Opus 4.5 is state-of-the-art on tests of real-world software engineering: Opus 4.5 is available today on our apps, our API, and on all three major…

Nov 24, 2025

Eval awareness in Claude Opus 4.6’s BrowseComp performance

…accomplish a task, and how difficult it will be to constrain its behavior in the real world, particularly on complex, compute-intensive, long-running tasks, which increase the likelihood of an agent…

Mar 6, 2026

Discussions and forums

r/netsec · u/Fickle-Box1433 · 1w ago

I evaluated 5 LLM agents on patching real-world CVEs. Here is what I found.

I built an independent benchmark with 20 real CVEs across 15 CWE categories, 5 models (3 OpenAI, 2 Poolside Laguna), three prompt conditions: full advisory, behavioral description only, and location only (file and functi…

Hacker News · u/deepakakkil · 3w ago

Show HN: Emergence World: World building as a way to evaluate LLMs

Current LLM benchmarks are broken. We think long horizon "world" building could be an interesting additional way to evaluate LLMs, since it combines many aspects such as need for advanced reasoning, tool calling, working…

Hacker News · u/adnan9999 · 2w ago

Show HN: Unsiloed AI – #1 on olmOCR-Bench

Most of the document parsers fail on real world challenges like complex tables, handwritten documents, historical document scans, equations, multi-column layouts, complex reading order, etc. We built Unsiloed Parser to h…

7 4

r/Games · u/Turbostrider27 · 3w ago

LEGO Batman: Legacy of the Dark Knight Review Thread

Game Information Game Title: LEGO Batman: Legacy of the Dark Knight Platforms: Nintendo Switch 2 (May 22, 2026) PlayStation 5 (May 22, 2026) Xbox Series X/S (May 22, 2026) PC (May 22, 2026) Trailer: Developer: Review Agg…

r/Android · u/MishaalRahman · 4w ago

New features, emojis, & security improvements: Here’s everything new coming to Android!

Hi Reddit, We just wrapped up The Android Show | I/O Edition, and a core theme of the show was how we’re making your phone more helpful so that you can spend less time looking at it and more time living your life. To mak…

HEAL: A framework for health equity assessment of machine learning performance

…AI and health equity, and may provide a useful evaluation framework not only during model development, but during pre-implementation and real-world monitoring stages, e.g., in the form of health…

Mar 15, 2024

NVIDIA Launches Earth-2 Family of Open Models — the World’s First Fully Open, Accelerated Set of Models and Tools for AI Weather

…Weather Forecasting AI weather tool provider Brightband — a member of the NVIDIA Inception program’s Sustainable Futures initiative — is running Earth-2 Medium Range to issue real-world global forecasts daily. “The…

Jan 26, 2026 · Mike Pritchard

Followed topics

Search

A New Approach for Evaluating AI Model Fairness

Top stories

Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3 | NVIDIA Technical Blog

Paper page - GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation