Research Interests
-
Safety of Large Language Models (LLMs)
- Efficient attacks for bypassing the safety alignment of LLMs
Papers
(* denotes equal contribution)
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
Jason Vega, Junsheng Huang*, Gaokai Zhang*, Hangoo Kang*, Minjia Zhang, Gagandeep Singh
Arxiv, 2024 (to appear); under peer review
We show that low-resource and unsophisticated attackers, i.e. stochastic monkeys, can significantly improve their chances of bypassing safety alignment of SoTA LLMs with just 25 random augmentations per prompt.
PaperBypassing the Safety Training of Open-Source LLMs with Priming Attacks
Jason Vega*, Isha Chaudhary*, Changming Xu*, Gagandeep Singh
ICLR 2024, Tiny Papers
We investigate the fragility of SOTA open-source LLMs under simple, optimization-free attacks we refer to as priming attacks (now known as prefilling attacks), which are easy to execute and effectively bypass alignment from safety training.
Paper Code WebsiteOther
- I grew up in the Bay Area 🌉 and will always be a Californian 🐻 at heart.
-
Outside of research, I enjoy:
- Playing the violin 🎻 in the UIUC Philharmonia Orchestra
- Going for a run 🏃
- Watching films and shows 🎥