Search: Safety and reliability

Paper page - Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

… However, their safety has received relatively limited attention. …

May 11, 2026

Paper page - Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

… The following papers were recommended by the Semantic Scholar API Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning 2026 SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models 2026 Risk-Adjusted H… …

May 1, 2026

Paper page - RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

… We further introduce ICU -Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. …

May 14, 2026

Paper page - MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

… This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills , with a focus on reliability against expert review . …

May 7, 2026

Paper page - Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

… The key finding: capital-agent reliability is an operating-layer problem. The largest reliability gains came from prompt compilation, typed controls, policy validation, execution guards, memory semantics, and full instruction-to-settlement observability. …

Apr 30, 2026

Paper page - Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

… View arXiv page View PDF Add to collection Community For questions or model-evaluation requests, contact guijin.son@snu.ac.kr . objective evaluation을 하기에는 class imbalance가 너무 큰게 아닌게 싶네요. the way soohak treats refusal as a first-class signal is a clever move, highlighting a real frontier beyond pure… …

May 12, 2026

Paper page - Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

… The following papers were recommended by the Semantic Scholar API Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents 2026 ClawEnvKit: Automatic Environment Generation for Claw-Like Agents 2026 One-Eval: An Agentic System for Automated and Traceable LLM Evaluation 2026 GTA-2: Benchmarking… …

May 1, 2026

Welcome to Inference Providers on the Hub 🔥

… And it's entirely plausible that the availability of such resources en masse has been the single most important prerequisite, as well as mandatory catalyst, for the continual evolution/democratization of open source ML specifically, whether in regards to utility or literacy or safety or any other o… …

Jun 27, 2025 · Burkay Gur

Followed topics