Demystifying evals for AI agents
…Here's what's worked across a range of agent architectures and use cases in real-world deployment. The structure of an evaluation An evaluation (“eval”) is a test for an AI…
…Here's what's worked across a range of agent architectures and use cases in real-world deployment. The structure of an evaluation An evaluation (“eval”) is a test for an AI…
…In this case, the agent understands the user's goal, and is genuinely trying to help, but takes initiative beyond what the user would approve. For example, it uses a credential it…
…Finally, we’re continuing to expand our partner ecosystem with new connectors and an MCP app, so the agents draw on the data financial professionals already use. Connectors give Claude governed, real…
…In that experiment, AI’s interaction with the real world was mediated by human labor. In this robodog experiment, we took a natural next step and used robots instead of people to…
…We’re eager to build even longer-horizon, real-world tasks that push model research capabilities, and to hear creative ideas from others. Send us your interesting benchmarks, innovative uses of AI…
…of real-world software engineering: Opus 4.5 is available today on our apps, our API, and on all three major cloud platforms. If you’re a developer, simply use claude-opus…
…Both hands-on testing and evals show Claude Opus 4.6 is a meaningful improvement for design systems and large codebases, use cases that drive enormous enterprise value. It also one-shotted…
…risk, observed exposure , that combines theoretical LLM capability and real-world usage data, weighting automated (rather than augmentative) and work-related uses more heavily AI is far from reaching its theoretical capability…
…Tailored onboarding, training, and best practices for rapid value realization. Financial institutions require the highest standards of data protection. By default, your data is not used for training our generative models, maintaining…
…Applications from earlier harnesses often looked impressive but still had real bugs when you actually tried to use them. To catch these, the evaluator used the Playwright MCP to click through the…