Paper page - MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills
… MedSkillAudit isn't a benchmark for ranking; it's a governance tool: structured feedback, actionable optimization guidance, pre-deployment gating. …
… MedSkillAudit isn't a benchmark for ranking; it's a governance tool: structured feedback, actionable optimization guidance, pre-deployment gating. …
… A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. …
… Generated by Qwen/Qwen2.5-Coder-32B-Instruct Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models ' performance, yet real-world deployment is constrained by strict computational budgets. …
… To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. …
… We study a 21-day deployment of 3,505 user-funded agents trading real ETH onchain. …
… These results show that safety behavior is not stable under ordinary downstream adaptation, which are important findings for anyone fine-tuning models, and raise critical questions about governance and deployment practices centered on base-model evaluations. …
… AI-generated summary This report describes ARIS Auto-Research-in-sleep , an open-source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience. …