anthropic.com › engineering Demystifying evals for AI agents …latency, token usage, cost per task, and error rates can be tracked on a static bank of tasks. Evals can also become the highest-bandwidth communication channel between product and research teams… Jan 9, 2026