From shortcuts to sabotage: natural emergent misalignment from reward hacking
… Misaligned models sabotaging safety research is one of the risks we’re most concerned about—we predict that AI models will themselves perform a lot of AI safety research in the near future, and we want to be assured that the results are trustworthy. …