Professor
Dept of Computer Science
Bren School of Information and Computer Sciences
(also affiliated with Linguistics and EECS)
Dr. Sameer Singh is a Professor of Computer Science at the University of California, Irvine (UCI). He is working primarily on robustness and interpretability of machine learning algorithms, along with models that reason with text and structure for natural language processing. Sameer was a postdoctoral researcher at the University of Washington and received his PhD from the University of Massachusetts, Amherst, during which he interned at Microsoft Research, Google Research, and Yahoo! Labs. He has received the NSF CAREER award, selected as a DARPA Riser, UCI ICS Mid-Career Excellence in research award, and the Hellman and the Noyce Faculty Fellowships. His group has received funding from Allen Institute for AI, Amazon, NSF, DARPA, Adobe Research, Hasso Plattner Institute, NEC, Base 11, and FICO. Sameer has published extensively at machine learning and natural language processing venues, including paper awards at KDD 2016, ACL 2018, EMNLP 2019, AKBC 2020, and ACL 2020.
External Links
Appointments
Education
Industry
Selected Recent Publications see all...
-
Calibrate Before Use: Improving Few-shot Performance of Language Models.
International Conference on Machine Learning (ICML).
2021
Conference
[ PDF, ArXiV, ICML Page, Video/Slides, BibTex ]@inproceedings{poisoning:icml21, author = {Tony Z. Zhao and Eric Wallace and Shi Feng and Dan Klein and Sameer Singh}, title = { {Calibrate Before Use: Improving Few-shot Performance of Language Models} }, booktitle = {International Conference on Machine Learning (ICML)}, pages = {12697-12706}, year = {2021} }
, , , , . -
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts .
Empirical Methods in Natural Language Processing (EMNLP).
2020
Conference
[ PDF, Website, ACL Anthology, Abstract, BibTex ]The remarkable success of pretrained language models has motivated the study of what kinds of knowledge these models learn during pretraining. Reformulating tasks as fill-in-the-blanks problems (e.g., cloze tests) is a natural approach for gauging such knowledge, however, its usage is limited by the manual effort and guesswork required to write suitable prompts. To address this, we develop AutoPrompt, an automated method to create prompts for a diverse set of tasks, based on a gradient-guided search. Using AutoPrompt, we show that masked language models (MLMs) have an inherent capability to perform sentiment analysis and natural language inference without additional parameters or finetuning, sometimes achieving performance on par with recent state-of-the-art supervised models. We also show that our prompts elicit more accurate factual knowledge from MLMs than the manually created prompts on the LAMA benchmark, and that MLMs can be used as relation extractors more effectively than supervised relation extraction models. These results demonstrate that automatically generated prompts are a viable parameter-free alternative to existing probing methods, and as pretrained LMs become more sophisticated and capable, potentially a replacement for finetuning.@inproceedings{autoprompt:emnlp20, author = {Taylor Shin and Yasaman Razeghi and Robert L. Logan IV and Eric Wallace and Sameer Singh}, title = { {AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts } }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, pages = {4222–4235}, year = {2020} }
, , , , . -
Beyond Accuracy: Behavioral Testing of NLP models with CheckList.
Association for Computational Linguistics (ACL).
2020
Conference
Best Paper Award
[ PDF, Code, ACL Anthology, Video+Slides, ArXiV, Abstract, BibTex ]Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.@inproceedings{checklist:acl20, author = {Marco Tulio Ribeiro and Tongshuang Wu and Carlos Guestrin and Sameer Singh}, title = { {Beyond Accuracy: Behavioral Testing of NLP models with CheckList} }, booktitle = {Association for Computational Linguistics (ACL)}, pages = {4902-4912}, year = {2020} }
, , , . -
Revisiting Evaluation of Knowledge Base Completion Models.
Automated Knowledge Base Construction (AKBC).
2020
Conference
Runner-up for Best Paper Award
[ PDF, Yago3-TC Data, Video+Slides, OpenReview, AKBC Page, Abstract, BibTex ]Representing knowledge graphs (KGs) by learning embeddings for entities and relations has led to accurate models for existing KG completion benchmarks. However, due to the open-world assumption of existing KGs, evaluation of KG completion uses ranking metrics and triple classification with negative samples, and is thus unable to directly assess models on the goals of the task: completion. In this paper, we first study the shortcomings of these evaluation metrics. Specifically, we demonstrate that these metrics (1) are unreliable for estimating how calibrated the models are, (2) make strong assumptions that are often violated, and 3) do not sufficiently, and consistently, differentiate embedding methods from each other, or from simpler approaches. To address these issues, we gather a semi-complete KG referred as YAGO3-TC, using a random subgraph from the test and validation data of YAGO3-10, which enables us to compute accurate triple classification accuracy on this data. Conducting thorough experiments on existing models, we provide new insights and directions for the KG completion research. Along with the dataset and the open source implementation of the models, we also provide a leaderboard for knowledge graph completion that consists of a hidden, and growing, test set, available at https://pouyapez.github.io/yago3-tc/.@inproceedings{kbeval:akbc20, author = {Pouya Pezeshkpour and Yifan Tian and Sameer Singh}, title = { {Revisiting Evaluation of Knowledge Base Completion Models} }, booktitle = {Automated Knowledge Base Construction (AKBC)}, year = {2020} }
, , . -
Universal Adversarial Triggers for Attacking and Analyzing NLP.
Empirical Methods in Natural Language Processing (EMNLP).
2019
Conference
[ PDF, arXiv, Blog post, Code, ACL Anthology, Abstract, BibTex ]Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset. We propose a gradient-guided search over tokens which finds short trigger sequences (e.g., one word for classification and four words for language modeling) that successfully trigger the target prediction. For example, triggers cause SNLI entailment accuracy to drop from 89.94% to 0.55%, 72% of “why” questions in SQuAD to be answered “to kill american people”, and the GPT-2 language model to spew racist output even when conditioned on non-racial contexts. Furthermore, although the triggers are optimized using white-box access to a specific model, they transfer to other models for all tasks we consider. Finally, since triggers are input-agnostic, they provide an analysis of global model behavior. For instance, they confirm that SNLI models exploit dataset biases and help to diagnose heuristics learned by reading comprehension models.@inproceedings{trigger:emnlp19, author = {Eric Wallace and Shi Feng and Nikhil Kandpal and Matt Gardner and Sameer Singh}, title = { {Universal Adversarial Triggers for Attacking and Analyzing NLP} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, doi = {10.18653/v1/D19-1221}, pages = {2153-2162}, year = {2019} }
, , , , . -
AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models.
Demo at the Empirical Methods in Natural Language Processing (EMNLP).
2019
Demo
Best Demonstration Paper Award.
[ PDF, Project Page, ACL Anthology, ArXiv, Poster, Abstract, BibTex ]Neural NLP models are increasingly accurate but are imperfect and opaque---they break in counterintuitive ways and leave end users puzzled at their behavior. Model interpretation methods ameliorate this opacity by providing explanations for specific model predictions. Unfortunately, existing interpretation codebases make it difficult to apply these methods to new models and tasks, which hinders adoption for practitioners and burdens interpretability researchers. We introduce AllenNLP Interpret, a flexible framework for interpreting NLP models. The toolkit provides interpretation primitives (e.g., input gradients) for any AllenNLP model and task, a suite of built-in interpretation methods, and a library of front-end visualization components. We demonstrate the toolkit's flexibility and utility by implementing live demos for five interpretation methods (e.g., saliency maps and adversarial attacks) on a variety of models and tasks (e.g., masked language modeling using BERT and reading comprehension using BiDAF). These demos, alongside our code and tutorials, are available at https://allennlp.org/interpret.@inproceedings{interpret:emnlp19, author = {Eric Wallace and Jens Tuyls and Junlin Wang and Sanjay Subramanian and Matt Gardner and Sameer Singh}, title = { {AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models} }, booktitle = {Demo at the Empirical Methods in Natural Language Processing (EMNLP)}, doi = {10.18653/v1/D19-3002}, pages = {7-12}, year = {2019} }
, , , , , . -
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
2019
Conference
[ PDF, Website, arXiv, Data, ACL Anthology, Leaderboard, Demo, Abstract, BibTex ]Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 55k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs, as they remove the paraphrase-and-entity-typing shortcuts available in prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literatures on this dataset and show that the best systems only achieve 38.4% F1 on our generalized accuracy metric, while expert human performance is 96%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.@inproceedings{drop:naacl19, author = {Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner}, title = { {DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs} }, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, doi = {10.18653/v1/N19-1246}, pages = {2368-2378}, year = {2019} }
, , , , , . , , .Semantically Equivalent Adversarial Rules for Debugging NLP models.
Association for Computational Linguistics (ACL).
2018
Conference -
"Why Should I Trust You?": Explaining the Predictions of Any Classifier.
Knowledge Discovery and Data Mining (KDD).
2016
Conference
Audience Appreciation Award
Also presented at the CHI 2016 Workshop on Human-Centred Machine Learning (HCML).
[ PDF, arXiv, Code, Video, O'Reilly, Code (experiments), ACM Page, BibTex ]@inproceedings{lime:kdd16, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {"Why Should I Trust You?": Explaining the Predictions of Any Classifier} }, booktitle = {Knowledge Discovery and Data Mining (KDD)}, month = {August}, doi = {10.1145/2939672.2939778}, pages = {1135-1144}, year = {2016} }
, , .
Honorable Mention for Best Paper.
[ PDF, Appendix, Code, ACL Anthology, Video, Slides, Abstract, BibTex ]
@inproceedings{sears:acl18, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {Semantically Equivalent Adversarial Rules for Debugging NLP models} }, booktitle = {Association for Computational Linguistics (ACL)}, doi = {10.18653/v1/P18-1079}, pages = {856-865}, year = {2018} }