From shortcuts to sabotage: natural emergent misalignment from reward hacking
… Even though the model was never trained or instructed to engage in any misaligned behaviors, those behaviors nonetheless emerged as a side effect of the model learning to reward hack. …