Search: publishing agreement

Paper page - The First Token Knows: Single-Decode Confidence for Hallucination Detection

… 0.793 semantic agreement vs. 0.791 surface-form self-consistency Across Llama-3.1-8B, Mistral-7B-v0.3, and Qwen2.5-7B on PopQA and TriviaQA n=1000 each Ensembling ϕ first with semantic agreement adds only +0.02 AUROC — first-token confidence already carries most of the signal Feedback welcome. …

May 7, 2026

Paper page - Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

… We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Kr… …

Jun 2, 2026

Paper page - Unsupervised Skill Discovery for Agentic Data Analysis

… For reasoning-style analysis, we instantiate it as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary signal. …

Jun 5, 2026

Paper page - MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

… System-expert agreement was quantified using ICC 2 , 1 and linearly weighted Cohen's kappa , benchmarked against the human inter-rater baseline. …

May 7, 2026

Paper page - When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

… Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. …

May 8, 2026

Paper page - BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

… Existing sycophancy research primarily focuses on factual agreement and instruction-following settings, leaving culturally grounded conversational sycophancy underexplored. …

Jun 10, 2026

Paper page - Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

… We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and w… …

May 15, 2026

Paper page - UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

… UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement , EMA teacher stabilization , token-level contrastive learning , feature matching , and divergence clipping . …

May 11, 2026

Paper page - VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

… We also construct the DeltaScene Dataset , a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. …