Demystifying evals for AI agents
… Capability vs. regression evals Capability or “quality” evals ask, “What can this agent do well?” They should start at a low pass rate, targeting tasks the agent struggles with and giving teams a hill to climb. …
… Capability vs. regression evals Capability or “quality” evals ask, “What can this agent do well?” They should start at a low pass rate, targeting tasks the agent struggles with and giving teams a hill to climb. …
… Our only window in was the WebSocket event stream, but that couldn’t tell us where failures arose, which meant that a bug in the harness, a packet drop in the event stream, or a container going offline all presented the same. …
… The process's credential is now mostly a copy of init cred : it has real uid 0, filesystem uid 0, and the full capability set, including CAP SETUID , the capability that lets a process change its own user IDs arbitrarily. …