Research Interests
- 
                                Trustworthy AI
                                - Safety/security of generative models (e.g., jailbreaking large language models)
- Certified and empirical robustness of discriminative models (e.g., for image classification, natural language processing)
- Interpretability of neural networks (e.g., explaining classifier predictions)
 
Papers
(* denotes equal contribution)
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
Jason Vega, Junsheng Huang*, Gaokai Zhang*, Hangoo Kang*, Minjia Zhang, Gagandeep Singh
Arxiv, 2024; under peer review
We show that low-resource and unsophisticated attackers, i.e. stochastic monkeys, can significantly improve their chances of bypassing safety alignment of SoTA LLMs with just 25 random augmentations per prompt.
Paper CodeBypassing the Safety Training of Open-Source LLMs with Priming Attacks
Jason Vega*, Isha Chaudhary*, Changming Xu*, Gagandeep Singh
ICLR 2024, Tiny Papers
We investigate the fragility of SOTA open-source LLMs under simple, optimization-free attacks we refer to as priming attacks (now known as prefilling attacks), which are easy to execute and effectively bypass alignment from safety training.
Paper Code WebsiteOther
- I grew up in the Bay Area 🌉 and will always be a Californian 🐻 at heart.
- 
                                Outside of research, I enjoy:
                                - Playing the violin 🎻 in the UIUC Philharmonia Orchestra
- Going for a run 🏃 (I'm running the Illinois Race Weekend Half Marathon in April 2025!)
- Watching films and shows 🎥
 
