Search

Showing top 117 results for "model-by-model evaluation"

People also ask

What’s the difference between evaluating an AI model and evaluating an AI agent? 

While model and agent evaluation are inextricably linked, their technical benchmarks and metrics for success are fundamentally different.

Mastering Agentic Techniques: AI Agent Evaluation | NVIDIA Technical Blog

Top stories

Discussions and forums

Hacker News · u/linzhiqiu · 1w ago

Show HN: VQAScore – open eval metric/reward model, now for text-to-video

Two years ago we released VQAScore: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score. It became a go-to evaluation metric and reward model for image generation, replacing CLIPScore across the field …

1
Hacker News · u/deepakakkil · May 15, 2026

Show HN: Emergence World: World building as a way to evaluate LLMs

Current LLM benchmarks are broken. We think long horizon "world" building could be an interesting additional way to evaluate LLMs, since it combines many aspects such as need for advanced reasoning, tool calling, working…

3
Hacker News · u/dhavalt · 2h ago

Show HN: AptSelect – A local LLM client for parallel testing and evaluation

I built AptSelect to stop writing throwaway scripts every time I needed to test how different LLMs handle specific instructions and prompt edge cases.What it does:Parallel Execution: Send a single prompt to OpenAI, Anthr…

2
Hacker News · u/jrhizor · 4w ago

Show HN: Elmo (Open Source AEO)

I'm excited to announce Elmo, an MIT-licensed, open source AEO/GEO tool.We help you scrape ChatGPT/Google AI Mode/etc using web scrapers like BrightData/Olostep/etc, evaluate prompts against the OpenAI/Anthropic/Mistral …

2
Hacker News · u/JohannaAlmeida · Apr 7, 2026

Hybrid Attention

TLDR: Forked pytorch and triton internals . Changed attention so its linear first layer , middle quadratic layer, last linear layer Inference got much faster with a low perplexity hit in tests .Full attention O(n²): 17.9…

40 9