Search

Showing top 116 results for "real-world evaluation"

Top stories

huggingface.co › papers › 2602.00095

Paper page - EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

… Unpacking Multimodal Error Analysis in Handwritten Math 2026 CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing 2026 EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content 2026 Unveiling Fine-Grained Visual Traces: Evaluati… …

May 8, 2026

Discussions and forums

r/netsec · u/Fickle-Box1433 · 4d ago

I evaluated 5 LLM agents on patching real-world CVEs. Here is what I found.

I built an independent benchmark with 20 real CVEs across 15 CWE categories, 5 models (3 OpenAI, 2 Poolside Laguna), three prompt conditions: full advisory, behavioral description only, and location only (file and functi…

Hacker News · u/deepakakkil · 2w ago

Show HN: Emergence World: World building as a way to evaluate LLMs

Current LLM benchmarks are broken. We think long horizon "world" building could be an interesting additional way to evaluate LLMs, since it combines many aspects such as need for advanced reasoning, tool calling, working…

3
Hacker News · u/adnan9999 · 1w ago

Show HN: Unsiloed AI – #1 on olmOCR-Bench

Most of the document parsers fail on real world challenges like complex tables, handwritten documents, historical document scans, equations, multi-column layouts, complex reading order, etc. We built Unsiloed Parser to h…

7 4
r/Games · u/Turbostrider27 · 2w ago

LEGO Batman: Legacy of the Dark Knight Review Thread

Game Information Game Title: LEGO Batman: Legacy of the Dark Knight Platforms: Nintendo Switch 2 (May 22, 2026) PlayStation 5 (May 22, 2026) Xbox Series X/S (May 22, 2026) PC (May 22, 2026) Trailer: Developer: Review Agg…

r/Android · u/MishaalRahman · 3w ago

New features, emojis, & security improvements: Here’s everything new coming to Android!

Hi Reddit, We just wrapped up The Android Show | I/O Edition, and a core theme of the show was how we’re making your phone more helpful so that you can spend less time looking at it and more time living your life. To mak…