Paper page - RewardHarness: Self-Evolving Agentic Post-Training
… This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. …