Search

Showing top 117 results for "model-by-model evaluation"

All sources huggingface.co 94 developer.nvidia.com 7 anthropic.com 6 amd.com 5 arstechnica.com 2 theverge.com 2 techcrunch.com 1 tomshardware.com 1 blogs.nvidia.com 1 intel.com 1 9to5mac.com 1

Paper page - Model-Based Quality Assessment for Massively Multilingual Parallel Data

Papers arxiv:2606.00285 Model-Based Quality Assessment for Massively Multilingual Parallel Data Published on May 29 Submitted by Shaoxiong on Jun 2 MaLA-LM Authors: , Zihao Li , , Abstract Multilingual parallel-data…

Jun 2, 2026

Paper page - Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

…for Evaluating Agent Values Published on May 11 Submitted by Haoran Ye on May 13 Peking University Authors: Haonan Dong , , , , , Abstract Autonomous agents exhibit distinct value systems from underlying language models, requiring…

May 13, 2026

Teaching Claude why

…data by sampling the model on each of the prompts and filtering down to cases where the assistant chose not to take the honeypot. Despite very closely matching the evaluation distribution, we…

May 8, 2026

Paper page - Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

…We evaluate 13 frontier models under a unified protocol. Key findings: The top model (Claude Opus 4.6) passes only 66.7% of tasks, and no model reaches 70%. Local workspace repair…

May 1, 2026

Discussions and forums

Hacker News · u/linzhiqiu · 1w ago

Show HN: VQAScore – open eval metric/reward model, now for text-to-video

Two years ago we released VQAScore: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score. It became a go-to evaluation metric and reward model for image generation, replacing CLIPScore across the field …

Hacker News · u/deepakakkil · May 15, 2026

Show HN: Emergence World: World building as a way to evaluate LLMs

Current LLM benchmarks are broken. We think long horizon "world" building could be an interesting additional way to evaluate LLMs, since it combines many aspects such as need for advanced reasoning, tool calling, working…

Hacker News · u/dhavalt · 2h ago

Show HN: AptSelect – A local LLM client for parallel testing and evaluation

I built AptSelect to stop writing throwaway scripts every time I needed to test how different LLMs handle specific instructions and prompt edge cases.What it does:Parallel Execution: Send a single prompt to OpenAI, Anthr…

Hacker News · u/jrhizor · 4w ago

Show HN: Elmo (Open Source AEO)

I'm excited to announce Elmo, an MIT-licensed, open source AEO/GEO tool.We help you scrape ChatGPT/Google AI Mode/etc using web scrapers like BrightData/Olostep/etc, evaluate prompts against the OpenAI/Anthropic/Mistral …

Hacker News · u/JohannaAlmeida · Apr 7, 2026

Hybrid Attention

TLDR: Forked pytorch and triton internals . Changed attention so its linear first layer , middle quadratic layer, last linear layer Inference got much faster with a low perplexity hit in tests .Full attention O(n²): 17.9…

40 9

Paper page - Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

…Do VLMs Know When Not to Answer Spatial Questions (and Why)? Published on May 28 Submitted by Yue Zhang on Jun 1 Authors: , , , , , Abstract Vision-language models exhibit overconfidence in spatial reasoning…

Jun 1, 2026

Paper page - Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

…Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports…

May 13, 2026

Paper page - NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

…foundation generative world model trained from the Cosmos diffusion model, enables real-time action-conditioned video generation for autonomous driving policy evaluation in complex, unseen scenarios. Generated by Qwen/Qwen2.5-Coder…

Jun 3, 2026

Paper page - SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

…is introduced to systematically evaluate speech modification capabilities across atomic and compositional tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Instruction-guided speech editing requires a model to modify specified speech…

Jun 5, 2026

Paper page - ModelLens: Finding the Best for Your Task from Myriads of Models

…of model capabilities across heterogeneous evaluation settings, a signal rich enough to learn from directly. By learning a performance-aware latent space over model--dataset--metric tuples , ModelLens ranks unseen models on…

May 11, 2026

Paper page - QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

…Social deduction games have become a popular testbed for evaluating reasoning, deception, coordination, and belief modeling in large language models. However, most environments evaluate agents primarily through game outcomes like win rates…

May 27, 2026

Followed topics