Search

Showing top 3 results for "offline capability questions"

Demystifying evals for AI agents

… Capability vs. regression evals Capability or “quality” evals ask, “What can this agent do well?” They should start at a low pass rate, targeting tasks the agent struggles with and giving teams a hill to climb. …

Jan 9, 2026

Scaling Managed Agents: Decoupling the brain from the hands

… Our only window in was the WebSocket event stream, but that couldn’t tell us where failures arose, which meant that a bug in the harness, a packet drop in the event stream, or a container going offline all presented the same. …

Apr 8, 2026

Assessing Claude Mythos Preview’s cybersecurity capabilities

… The process's credential is now mostly a copy of init cred : it has real uid 0, filesystem uid 0, and the full capability set, including CAP SETUID , the capability that lets a process change its own user IDs arbitrarily. …

Apr 7, 2026

Followed topics

Demystifying evals for AI agents

Scaling Managed Agents: Decoupling the brain from the hands

Assessing Claude Mythos Preview’s cybersecurity capabilities