Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks
… For example, an attacker might embed a harmful query as a series of functions scattered throughout a codebase, then instruct the model to extract and respond to the hidden message. …