Sameer Singh

Sameer Singh
4224 Donald Bren Hall
University of California
Irvine, CA 92697-3435

Dr. Sameer Singh is an Associate Professor of Computer Science at the University of California, Irvine (UCI). He is working primarily on robustness and interpretability of machine learning algorithms, along with models that reason with text and structure for natural language processing. Sameer was a postdoctoral researcher at the University of Washington and received his PhD from the University of Massachusetts, Amherst, during which he interned at Microsoft Research, Google Research, and Yahoo! Labs. He has received the NSF CAREER award, selected as a DARPA Riser, UCI ICS Mid-Career Excellence in research award, and the Hellman and the Noyce Faculty Fellowships. His group has received funding from Allen Institute for AI, Amazon, NSF, DARPA, Adobe Research, Hasso Plattner Institute, NEC, Base 11, and FICO. Sameer has published extensively at machine learning and natural language processing venues, including paper awards at KDD 2016, ACL 2018, EMNLP 2019, AKBC 2020, and ACL 2020.

CV (as of 2020)

External Links


Univ of California
Irvine CA
University of California, Irvine
Assistant Professor
2016 - current

Univ of Washington
Seattle WA
Postdoctoral Researcher
2013 - 2016


PhD (CS)
Univ of Massachusetts
Amherst MA

Vanderbilt University
Nashville TN

BEng (EE)
University of Delhi
New Delhi

High School
Sardar Patel Vidyalaya
New Delhi


Microsoft Research
Cambridge UK


Research Intern
Summer 2012

Google Research
Mountain View CA
Research Intern
Summer 2010

Yahoo! Labs
Sunnyvale CA
Research Intern
Summer 2009

Piitsburgh PA
Research Intern
Summer, Fall 2007

Selected Recent Publications see all...

  • Tony Z. ZhaoEric WallaceShi FengDan KleinSameer Singh.Calibrate Before Use: Improving Few-shot Performance of Language Models. International Conference on Machine Learning (ICML). 2021 Conference
    PDFArXiVICML PageVideo/Slides, BibTex ]
      author = {Tony Z. Zhao and Eric Wallace and Shi Feng and Dan Klein and Sameer Singh},
      title = { {Calibrate Before Use: Improving Few-shot Performance of Language Models} },
      booktitle = {International Conference on Machine Learning (ICML)},
      pages = {12697-12706},
      year = {2021}
  • Taylor ShinYasaman RazeghiRobert L. Logan IVEric WallaceSameer Singh.AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts . Empirical Methods in Natural Language Processing (EMNLP). 2020 Conference
    PDFWebsiteACL Anthology, Abstract, BibTex ]
    The remarkable success of pretrained language models has motivated the study of what kinds of knowledge these models learn during pretraining. Reformulating tasks as fill-in-the-blanks problems (e.g., cloze tests) is a natural approach for gauging such knowledge, however, its usage is limited by the manual effort and guesswork required to write suitable prompts. To address this, we develop AutoPrompt, an automated method to create prompts for a diverse set of tasks, based on a gradient-guided search. Using AutoPrompt, we show that masked language models (MLMs) have an inherent capability to perform sentiment analysis and natural language inference without additional parameters or finetuning, sometimes achieving performance on par with recent state-of-the-art supervised models. We also show that our prompts elicit more accurate factual knowledge from MLMs than the manually created prompts on the LAMA benchmark, and that MLMs can be used as relation extractors more effectively than supervised relation extraction models. These results demonstrate that automatically generated prompts are a viable parameter-free alternative to existing probing methods, and as pretrained LMs become more sophisticated and capable, potentially a replacement for finetuning.
      author = {Taylor Shin and Yasaman Razeghi and Robert L. Logan IV and Eric Wallace and Sameer Singh},
      title = { {AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts } },
      booktitle = {Empirical Methods in Natural Language Processing (EMNLP)},
      pages = {4222–4235},
      year = {2020}
  • Marco Tulio RibeiroTongshuang WuCarlos GuestrinSameer Singh.Beyond Accuracy: Behavioral Testing of NLP models with CheckList. Association for Computational Linguistics (ACL). 2020 Conference
    Best Paper Award
    PDFCodeACL AnthologyVideo+SlidesArXiV, Abstract, BibTex ]
    Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
      author = {Marco Tulio Ribeiro and Tongshuang Wu and Carlos Guestrin and Sameer Singh},
      title = { {Beyond Accuracy: Behavioral Testing of NLP models with CheckList} },
      booktitle = {Association for Computational Linguistics (ACL)},
      pages = {4902-4912},
      year = {2020}
  • Pouya PezeshkpourYifan TianSameer Singh.Revisiting Evaluation of Knowledge Base Completion Models. Automated Knowledge Base Construction (AKBC). 2020 Conference
    Runner-up for Best Paper Award
    PDFYago3-TC DataVideo+SlidesOpenReviewAKBC Page, Abstract, BibTex ]
    Representing knowledge graphs (KGs) by learning embeddings for entities and relations has led to accurate models for existing KG completion benchmarks. However, due to the open-world assumption of existing KGs, evaluation of KG completion uses ranking metrics and triple classification with negative samples, and is thus unable to directly assess models on the goals of the task: completion. In this paper, we first study the shortcomings of these evaluation metrics. Specifically, we demonstrate that these metrics (1) are unreliable for estimating how calibrated the models are, (2) make strong assumptions that are often violated, and 3) do not sufficiently, and consistently, differentiate embedding methods from each other, or from simpler approaches. To address these issues, we gather a semi-complete KG referred as YAGO3-TC, using a random subgraph from the test and validation data of YAGO3-10, which enables us to compute accurate triple classification accuracy on this data. Conducting thorough experiments on existing models, we provide new insights and directions for the KG completion research. Along with the dataset and the open source implementation of the models, we also provide a leaderboard for knowledge graph completion that consists of a hidden, and growing, test set, available at
      author = {Pouya Pezeshkpour and Yifan Tian and Sameer Singh},
      title = { {Revisiting Evaluation of Knowledge Base Completion Models} },
      booktitle = {Automated Knowledge Base Construction (AKBC)},
      year = {2020}
  • Eric WallaceShi FengNikhil KandpalMatt GardnerSameer Singh.Universal Adversarial Triggers for Attacking and Analyzing NLP. Empirical Methods in Natural Language Processing (EMNLP). 2019 Conference
    PDFarXivBlog postCodeACL Anthology, Abstract, BibTex ]
    Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset. We propose a gradient-guided search over tokens which finds short trigger sequences (e.g., one word for classification and four words for language modeling) that successfully trigger the target prediction. For example, triggers cause SNLI entailment accuracy to drop from 89.94% to 0.55%, 72% of “why” questions in SQuAD to be answered “to kill american people”, and the GPT-2 language model to spew racist output even when conditioned on non-racial contexts. Furthermore, although the triggers are optimized using white-box access to a specific model, they transfer to other models for all tasks we consider. Finally, since triggers are input-agnostic, they provide an analysis of global model behavior. For instance, they confirm that SNLI models exploit dataset biases and help to diagnose heuristics learned by reading comprehension models.
      author = {Eric Wallace and Shi Feng and Nikhil Kandpal and Matt Gardner and Sameer Singh},
      title = { {Universal Adversarial Triggers for Attacking and Analyzing NLP} },
      booktitle = {Empirical Methods in Natural Language Processing (EMNLP)},
      doi = {10.18653/v1/D19-1221},
      pages = {2153-2162},
      year = {2019}
  • Eric WallaceJens TuylsJunlin WangSanjay SubramanianMatt GardnerSameer Singh.AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models. Demo at the Empirical Methods in Natural Language Processing (EMNLP). 2019 Demo
    Best Demonstration Paper Award.
    PDFProject PageACL AnthologyArXivPoster, Abstract, BibTex ]
    Neural NLP models are increasingly accurate but are imperfect and opaque---they break in counterintuitive ways and leave end users puzzled at their behavior. Model interpretation methods ameliorate this opacity by providing explanations for specific model predictions. Unfortunately, existing interpretation codebases make it difficult to apply these methods to new models and tasks, which hinders adoption for practitioners and burdens interpretability researchers. We introduce AllenNLP Interpret, a flexible framework for interpreting NLP models. The toolkit provides interpretation primitives (e.g., input gradients) for any AllenNLP model and task, a suite of built-in interpretation methods, and a library of front-end visualization components. We demonstrate the toolkit's flexibility and utility by implementing live demos for five interpretation methods (e.g., saliency maps and adversarial attacks) on a variety of models and tasks (e.g., masked language modeling using BERT and reading comprehension using BiDAF). These demos, alongside our code and tutorials, are available at
      author = {Eric Wallace and Jens Tuyls and Junlin Wang and Sanjay Subramanian and Matt Gardner and Sameer Singh},
      title = { {AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models} },
      booktitle = {Demo at the Empirical Methods in Natural Language Processing (EMNLP)},
      doi = {10.18653/v1/D19-3002},
      pages = {7-12},
      year = {2019}
  • Dheeru DuaYizhong WangPradeep DasigiGabriel StanovskySameer SinghMatt Gardner.DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2019 Conference
    PDFWebsitearXivDataACL AnthologyLeaderboardDemo, Abstract, BibTex ]
    Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 55k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs, as they remove the paraphrase-and-entity-typing shortcuts available in prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literatures on this dataset and show that the best systems only achieve 38.4% F1 on our generalized accuracy metric, while expert human performance is 96%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.
      author = {Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},
      title = { {DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs} },
      booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
      doi = {10.18653/v1/N19-1246},
      pages = {2368-2378},
      year = {2019}
  • Marco Tulio RibeiroSameer SinghCarlos Guestrin.Semantically Equivalent Adversarial Rules for Debugging NLP models. Association for Computational Linguistics (ACL). 2018 Conference
    Honorable Mention for Best Paper.
    PDFAppendixCodeACL AnthologyVideoSlides, Abstract, BibTex ]
    Complex machine learning models for NLP are often brittle, making different predictions for input instances that are extremely similar semantically. To automatically detect this behavior for individual instances, we present semantically equivalent adversaries (SEAs) - semantic-preserving perturbations that induce changes in the model’s predictions. We generalize these adversaries into semantically equivalent adversarial rules (SEARs) - simple, universal replacement rules that induce adversaries on many instances. We demonstrate the usefulness and flexibility of SEAs and SEARs by detecting bugs in black-box state-of-the-art models for three domains: machine comprehension, visual question-answering, and sentiment analysis. Via user studies, we demonstrate that we generate high-quality local adversaries for more instances than humans, and that SEARs induce four times as many mistakes as the bugs discovered by human experts. SEARs are also actionable: retraining models using data augmentation significantly reduces bugs, while maintaining accuracy.
      author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin},
      title = { {Semantically Equivalent Adversarial Rules for Debugging NLP models} },
      booktitle = {Association for Computational Linguistics (ACL)},
      doi = {10.18653/v1/P18-1079},
      pages = {856-865},
      year = {2018}
  • Marco Tulio RibeiroSameer SinghCarlos Guestrin."Why Should I Trust You?": Explaining the Predictions of Any Classifier. Knowledge Discovery and Data Mining (KDD). 2016 Conference
    Audience Appreciation Award
    Also presented at the CHI 2016 Workshop on Human-Centred Machine Learning (HCML).
    PDFarXivCodeVideoO'ReillyCode (experiments)ACM Page, BibTex ]
      author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin},
      title = { {"Why Should I Trust You?": Explaining the Predictions of Any Classifier} },
      booktitle = {Knowledge Discovery and Data Mining (KDD)},
      month = {August},
      doi = {10.1145/2939672.2939778},
      pages = {1135-1144},
      year = {2016}