Search

Showing top 114 results for "real-world evaluation"

All sources huggingface.co 45 anthropic.com 13 developer.nvidia.com 12 blogs.nvidia.com 7 amd.com 6 spectrum.ieee.org 4 blog.google 3 wccftech.com 2 techcrunch.com 2 xda-developers.com 2 techpowerup.com 2 tweaktown.com 2

Donating our open-source alignment tool

…appear realistic, a model can often deduce from various artificialities in the setup that it’s actually part of a test. And if the model is aware it’s being evaluated, the…

May 7, 2026

Paper page - ModelLens: Finding the Best for Your Task from Myriads of Models

…unified framework that recommends models in real-world scenarios by learning from public leaderboard data to rank unseen models on unseen datasets without requiring costly evaluations. AI-generated summary The open-source…

May 11, 2026

Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters

…The denominator is everything beneath the surface, which represents key factors that determine real-world token output. Accurately evaluating AI infrastructure starts with asking what lies beneath. Surface-level inquiry: What is…

Apr 15, 2026 · Shruti Koparkar

Paper page - Nexus : An Agentic Framework for Time Series Forecasting

…While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their…

May 18, 2026

Discussions and forums

r/netsec · u/Fickle-Box1433 · 1w ago

I evaluated 5 LLM agents on patching real-world CVEs. Here is what I found.

I built an independent benchmark with 20 real CVEs across 15 CWE categories, 5 models (3 OpenAI, 2 Poolside Laguna), three prompt conditions: full advisory, behavioral description only, and location only (file and functi…

Hacker News · u/deepakakkil · 3w ago

Show HN: Emergence World: World building as a way to evaluate LLMs

Current LLM benchmarks are broken. We think long horizon "world" building could be an interesting additional way to evaluate LLMs, since it combines many aspects such as need for advanced reasoning, tool calling, working…

Hacker News · u/adnan9999 · 1w ago

Show HN: Unsiloed AI – #1 on olmOCR-Bench

Most of the document parsers fail on real world challenges like complex tables, handwritten documents, historical document scans, equations, multi-column layouts, complex reading order, etc. We built Unsiloed Parser to h…

7 4

r/Games · u/Turbostrider27 · 2w ago

LEGO Batman: Legacy of the Dark Knight Review Thread

Game Information Game Title: LEGO Batman: Legacy of the Dark Knight Platforms: Nintendo Switch 2 (May 22, 2026) PlayStation 5 (May 22, 2026) Xbox Series X/S (May 22, 2026) PC (May 22, 2026) Trailer: Developer: Review Agg…

r/Android · u/MishaalRahman · 3w ago

New features, emojis, & security improvements: Here’s everything new coming to Android!

Hi Reddit, We just wrapped up The Android Show | I/O Edition, and a core theme of the show was how we’re making your phone more helpful so that you can spend less time looking at it and more time living your life. To mak…

Can AI Chatbots Reason Like Doctors?

…in progress,” Tordjman says. “There’s no perfect way to evaluate LLMs in clinical reasoning.” Testing Medical AI in the Real World For the Science study, the researchers tested the OpenAI model…

May 13, 2026 · Greg Uyeno

Google says Android with Chrome is ‘fastest mobile platform for web browsing’

…Google cites two aspects to evaluating web performance, starting with responsiveness. Speedometer “simulates real-world user actions… to measure interaction latency” and is used by all major browser engines. High scores translate…

Mar 25, 2026 · Abner Li

Harness design for long-running application development

…But for the parts of the build that were still at the edge of the generator’s capabilities, the evaluator continued to give real lift. The practical implication is that the evaluator…

Mar 24, 2026

Paper page - Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

…A Unified Multimodal Agent for World-Grounded Image Synthesis (2026) Latent World Models for Automated Driving: A Unified Taxonomy, Evaluation Framework, and Open Challenges (2026) From Visual Synthesis to Interactive Worlds: Toward…

May 1, 2026

NVIDIA NeMo Retriever

…in retrieval technology—the foundation that powers world‑class intelligent document processing. World-Class Information-Retrieval Performance Nemotron accelerates multimodal document extraction and real-time retrieval with lower costs and higher accuracy…

Paper page - RLDX-1 Technical Report

…Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. π_{0.5} and GR00T N1.6) across both simulation benchmarks and real-world tasks that require…