Research Interests
-
Trustworthy AI
- Safety/security of generative models (e.g., jailbreaking large language models)
- Certified and empirical robustness of discriminative models (e.g., for image classification, natural language processing)
- Interpretability of neural networks (e.g., explaining classifier predictions)
Papers
(* denotes equal contribution)
[NEW!] Matching Ranks Over Probability Yields Truly Deep Safety Alignment
Jason Vega, Gagandeep Singh
Under review, 2025
We find that an SFT-based data augmentation approach to deep safety alignment still exhibits safety vulnerabilities against a more general form of the prefilling attack we call the Rank-Assisted Prefilling (RAP) attack. We then propose a novel perspective on achieving deep safety alignment that yields a simple fix based on attention regularization.
Paper CodeStochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
Jason Vega, Junsheng Huang*, Gaokai Zhang*, Hangoo Kang*, Minjia Zhang, Gagandeep Singh
Arxiv, 2024
We show that low-resource and unsophisticated attackers, i.e. stochastic monkeys, can significantly improve their chances of bypassing safety alignment of SoTA LLMs with just 25 random augmentations per prompt.
Paper CodeBypassing the Safety Training of Open-Source LLMs with Priming Attacks
Jason Vega*, Isha Chaudhary*, Changming Xu*, Gagandeep Singh
ICLR 2024, Tiny Papers
We investigate the fragility of SOTA open-source LLMs under simple, optimization-free attacks we refer to as priming attacks (now known as prefilling attacks), which are easy to execute and effectively bypass alignment from safety training.
Paper Code WebsiteOther
- I grew up in the Bay Area 🌉 and will always be a Californian 🐻 at heart.
-
Outside of research, I enjoy:
- Playing the violin 🎻 in the UIUC Philharmonia Orchestra
- Going for a run 🏃 (I'm running the Illinois Race Weekend Marathon in April 2026!)
- Watching films and shows 🎥