Paper page - When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
…Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks (2026) AuditRepairBench: A Paired-Execution Trace Corpus for Evaluator-Channel Ranking Instability in Agent Repair (2026) ValueBlindBench: Agreement-Gated Stress Testing of…