Paper page - A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
…need to execute, then synthesize realistic tasks around them. The result: models that look strong on existing benchmarks face a much tougher and broader test. This is an automated message from the…