Paper page - Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs
… However, their safety has received relatively limited attention. …
… However, their safety has received relatively limited attention. …
… The following papers were recommended by the Semantic Scholar API Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning 2026 SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models 2026 Risk-Adjusted H… …
… We further introduce ICU -Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. …
… This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills , with a focus on reliability against expert review . …
… The key finding: capital-agent reliability is an operating-layer problem. The largest reliability gains came from prompt compilation, typed controls, policy validation, execution guards, memory semantics, and full instruction-to-settlement observability. …
… View arXiv page View PDF Add to collection Community For questions or model-evaluation requests, contact guijin.son@snu.ac.kr . objective evaluation을 하기에는 class imbalance가 너무 큰게 아닌게 싶네요. the way soohak treats refusal as a first-class signal is a clever move, highlighting a real frontier beyond pure… …
… The following papers were recommended by the Semantic Scholar API Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents 2026 ClawEnvKit: Automatic Environment Generation for Claw-Like Agents 2026 One-Eval: An Agentic System for Automated and Traceable LLM Evaluation 2026 GTA-2: Benchmarking… …
… And it's entirely plausible that the availability of such resources en masse has been the single most important prerequisite, as well as mandatory catalyst, for the continual evolution/democratization of open source ML specifically, whether in regards to utility or literacy or safety or any other o… …