Adversarial Attacks on Aligned Language Models
I decided to ask a certain popular language model how to build an explosive, from everday items (for no particular reason), but it didn’t give me a plausible answer. What is happening here?
I decided to ask a certain popular language model how to build an explosive, from everday items (for no particular reason), but it didn’t give me a plausible answer. What is happening here?