Demystifying evals for AI agents
… As shown in the illustrative YAML file below, one could evaluate this agent using both graders and metrics. task: id: "fix-auth-bypass 1" desc: "Fix authentication bypass when password field is empty and ..." graders: - type: deterministic tests required: test empty pw rejected.py, test null pw rej… …