Paper page - Models That Know How Evaluations Are Designed Score Safer
… Evaluating this fine-tuned model on six safety benchmarks , we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. …