Honors
- Kavli Fellow, Kavli Frontiers of Science symposia, National Academy of Sciences (NAS) (2023)
- Distinguished Early-Career Faculty Award for Research, University of California, Irvine (2021)
- CAREER Award, National Science Foundation (NSF) (2021)
- Noyce Faculty Fellow, Noyce Institute at UCI (2020)
- Hellman Fellowship, Hellman Family Foundation (2020)
- Dean’s Mid-Career Award for Excellence in Research, University of California, Irvine (2020)
- DTEI Dean’s Honoree Award for Undergraduate Teaching, University of California, Irvine (2019)
- Outstanding Area Chair, Empirical Methods in Natural Language Processing (EMNLP) (2019)
- Dean’s Award for Excellence in Undergraduate Teaching, University of California, Irvine (2018)
- DARPA Riser, DARPA Wait, What? Event (2015)
- Yahoo! Key Scientific Challenges (KSC) Award, in Machine Learning & Statistics, Yahoo! Research (2010-2011)
- Accomplishments in Search & Mining Award, UMass CS Dept and Yahoo! (2010-2011)
Paper Awards
- 
        , , , .TalkToModel: Explaining Machine Learning Models with Interactive Natural Language Conversations .
        
        TSRML Workshop @ NeurIPS.
        
        
        2022
        
        Workshop
 Honoral Mention for Best Paper
 [ ArXiV, Code, Demo, BibTex ]@inproceedings{talktomodel:tsrml22, author = {Dylan Slack and Satyapriya Krishna and Himabindu Lakkaraju and Sameer Singh}, title = { {TalkToModel: Explaining Machine Learning Models with Interactive Natural Language Conversations } }, booktitle = {TSRML Workshop @ NeurIPS}, year = {2022} }
- 
        , , , .FRUIT: Faithfully Reflecting Updated Information in Text.
        
        Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
        
        
        2022
        
        Conference
 Best Task Paper Award
 [ PDF, ACL Anthology, ArXiV, Abstract, BibTex ]Textual knowledge bases such as Wikipedia require considerable effort to keep up to date and consistent. While automated writing assistants could potentially ease this burden, the problem of suggesting edits grounded in external knowledge has been under-explored. In this paper, we introduce the novel generation task of *faithfully reflecting updated information in text* (FRUIT) where the goal is to update an existing article given new evidence. We release the FRUIT-WIKI dataset, a collection of over 170K distantly supervised data produced from pairs of Wikipedia snapshots, along with our data generation pipeline and a gold evaluation set of 914 instances whose edits are guaranteed to be supported by the evidence. We provide benchmark results for popular generation systems as well as EDIT5 – a T5-based approach tailored to editing we introduce that establishes the state of the art. Our analysis shows that developing models that can update articles faithfully requires new capabilities for neural generation models, and opens doors to many new applications.@inproceedings{fruit:naacl22, author = {Robert L. Logan IV and Alexandre Passos and Sameer Singh and Ming-Wei Chang}, title = { {FRUIT: Faithfully Reflecting Updated Information in Text} }, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, doi = {10.18653/v1/2022.naacl-main.269}, pages = {3670-3686}, year = {2022} }
- 
        , , , , , .Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models.
        
        NeurIPS Workshop on Efficient Natural Language and Speech Processing (ENLSP).
        
        
        2021
        
        Workshop
 Best Poster Award
 [ ArXiV, PDF, Code, BibTex ]@inproceedings{nullprompts:effnlp21, author = {Robert L. Logan IV and Ivana Balažević and Eric Wallace and Fabio Petroni and Sameer Singh and Sebastian Riedel}, title = { {Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models} }, booktitle = {NeurIPS Workshop on Efficient Natural Language and Speech Processing (ENLSP)}, year = {2021} }
- 
        , , , , , .COVIDLies: Detecting COVID-19 Misinformation on Social Media.
        
        EMNLP NLP Covid19 Workshop.
        
        
        2020
        
        Workshop
 Best Paper Award
 [ PDF, ACL Anthology, Website (w/ demo), Abstract, BibTex ]The ongoing pandemic has heightened the need for developing tools to flag COVID-19-related misinformation on the internet, specifically on social media such as Twitter. However, due to novel language and the rapid change of information, existing misinformation detection datasets are not effective for evaluating systems designed to detect misinformation on this topic. Misinformation detection can be divided into two sub-tasks: (i) retrieval of misconceptions relevant to posts being checked for veracity, and (ii) stance detection to identify whether the posts Agree, Disagree, or express No Stance towards the retrieved misconceptions. To facilitate research on this task, we release COVIDLies (https://ucinlp.github.io/covid19), a dataset of 6761 expert-annotated tweets to evaluate the performance of misinformation detection systems on 86 different pieces of COVID-19 related misinformation. We evaluate existing NLP systems on this dataset, providing initial benchmarks and identifying key challenges for future models to improve upon.@inproceedings{covidlies:nlpcovid20, author = {Tamanna Hossain and Robert L. Logan IV and Arjuna Ugarte and Yoshitomo Matsubara and Sean Young and Sameer Singh}, title = { {COVIDLies: Detecting COVID-19 Misinformation on Social Media} }, booktitle = {EMNLP NLP Covid19 Workshop}, doi = {10.18653/v1/2020.nlpcovid19-2.11}, year = {2020} }
- 
        , , , .Beyond Accuracy: Behavioral Testing of NLP models with CheckList.
        
        Association for Computational Linguistics (ACL).
        
        
        2020
        
        Conference
 Best Paper Award
 [ PDF, Code, ACL Anthology, Video+Slides, ArXiV, Abstract, BibTex ]Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.@inproceedings{checklist:acl20, author = {Marco Tulio Ribeiro and Tongshuang Wu and Carlos Guestrin and Sameer Singh}, title = { {Beyond Accuracy: Behavioral Testing of NLP models with CheckList} }, booktitle = {Association for Computational Linguistics (ACL)}, pages = {4902-4912}, year = {2020} }
- 
        , , .Revisiting Evaluation of Knowledge Base Completion Models.
        
        Automated Knowledge Base Construction (AKBC).
        
        
        2020
        
        Conference
 Runner-up for Best Paper Award
 [ PDF, Yago3-TC Data, Video+Slides, OpenReview, AKBC Page, Abstract, BibTex ]Representing knowledge graphs (KGs) by learning embeddings for entities and relations has led to accurate models for existing KG completion benchmarks. However, due to the open-world assumption of existing KGs, evaluation of KG completion uses ranking metrics and triple classification with negative samples, and is thus unable to directly assess models on the goals of the task: completion. In this paper, we first study the shortcomings of these evaluation metrics. Specifically, we demonstrate that these metrics (1) are unreliable for estimating how calibrated the models are, (2) make strong assumptions that are often violated, and 3) do not sufficiently, and consistently, differentiate embedding methods from each other, or from simpler approaches. To address these issues, we gather a semi-complete KG referred as YAGO3-TC, using a random subgraph from the test and validation data of YAGO3-10, which enables us to compute accurate triple classification accuracy on this data. Conducting thorough experiments on existing models, we provide new insights and directions for the KG completion research. Along with the dataset and the open source implementation of the models, we also provide a leaderboard for knowledge graph completion that consists of a hidden, and growing, test set, available at https://pouyapez.github.io/yago3-tc/.@inproceedings{kbeval:akbc20, author = {Pouya Pezeshkpour and Yifan Tian and Sameer Singh}, title = { {Revisiting Evaluation of Knowledge Base Completion Models} }, booktitle = {Automated Knowledge Base Construction (AKBC)}, year = {2020} }
- 
        , , , .Evaluating Question Answering Evaluation.
        
        Workshop on Machine Reading and Question Answering (MRQA).
        
        
        2019
        
        Workshop
 Best Paper Award.
 [ PDF, BibTex ]@inproceedings{evalqa:mrqa19, author = {Anthony Chen and Gabriel Stanovsky and Sameer Singh and Matt Gardner}, title = { {Evaluating Question Answering Evaluation} }, booktitle = {Workshop on Machine Reading and Question Answering (MRQA)}, year = {2019} }
- 
        , , , , , .AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models.
        
        Demo at the Empirical Methods in Natural Language Processing (EMNLP).
        
        
        2019
        
        Demo
 Best Demonstration Paper Award.
 [ PDF, Project Page, ACL Anthology, ArXiv, Poster, Abstract, BibTex ]Neural NLP models are increasingly accurate but are imperfect and opaque---they break in counterintuitive ways and leave end users puzzled at their behavior. Model interpretation methods ameliorate this opacity by providing explanations for specific model predictions. Unfortunately, existing interpretation codebases make it difficult to apply these methods to new models and tasks, which hinders adoption for practitioners and burdens interpretability researchers. We introduce AllenNLP Interpret, a flexible framework for interpreting NLP models. The toolkit provides interpretation primitives (e.g., input gradients) for any AllenNLP model and task, a suite of built-in interpretation methods, and a library of front-end visualization components. We demonstrate the toolkit's flexibility and utility by implementing live demos for five interpretation methods (e.g., saliency maps and adversarial attacks) on a variety of models and tasks (e.g., masked language modeling using BERT and reading comprehension using BiDAF). These demos, alongside our code and tutorials, are available at https://allennlp.org/interpret.@inproceedings{interpret:emnlp19, author = {Eric Wallace and Jens Tuyls and Junlin Wang and Sanjay Subramanian and Matt Gardner and Sameer Singh}, title = { {AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models} }, booktitle = {Demo at the Empirical Methods in Natural Language Processing (EMNLP)}, doi = {10.18653/v1/D19-3002}, pages = {7-12}, year = {2019} }
- 
        , , .Semantically Equivalent Adversarial Rules for Debugging NLP models.
        
        Association for Computational Linguistics (ACL).
        
        
        2018
        
        Conference
 Honorable Mention for Best Paper.
 [ PDF, Appendix, Code, ACL Anthology, Video, Slides, Abstract, BibTex ]Complex machine learning models for NLP are often brittle, making different predictions for input instances that are extremely similar semantically. To automatically detect this behavior for individual instances, we present semantically equivalent adversaries (SEAs) - semantic-preserving perturbations that induce changes in the model’s predictions. We generalize these adversaries into semantically equivalent adversarial rules (SEARs) - simple, universal replacement rules that induce adversaries on many instances. We demonstrate the usefulness and flexibility of SEAs and SEARs by detecting bugs in black-box state-of-the-art models for three domains: machine comprehension, visual question-answering, and sentiment analysis. Via user studies, we demonstrate that we generate high-quality local adversaries for more instances than humans, and that SEARs induce four times as many mistakes as the bugs discovered by human experts. SEARs are also actionable: retraining models using data augmentation significantly reduces bugs, while maintaining accuracy.@inproceedings{sears:acl18, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {Semantically Equivalent Adversarial Rules for Debugging NLP models} }, booktitle = {Association for Computational Linguistics (ACL)}, doi = {10.18653/v1/P18-1079}, pages = {856-865}, year = {2018} }
- 
        , , .Generating Natural Adversarial Examples.
        
        NeurIPS Workshop on Machine Deception.
        
        
        2017
        
        Workshop
 Amazon Best Poster Award at the Southern California Machine Learning Symposium.
 Shorter version of the paper at ICLR 2018.
 [ PDF, ArXiv (full paper), Abstract, BibTex ]Due to their complex nature, it is hard to characterize the ways in which machine learning models can misbehave or be exploited when deployed. Recent work on adversarial examples, i.e. inputs with minor perturbations that result in substantially different model predictions, is helpful in evaluating the robustness of these models by exposing the adversarial scenarios where they fail. However, these malicious perturbations are often unnatural, not semantically meaningful, and not applicable to complicated domains such as language. In this paper, we propose a framework to generate natural and legible adversarial examples by searching in semantic space of dense and continuous data representation, utilizing the recent advances in generative adversarial networks. We present generated adversaries to demonstrate the potential of the proposed approach for black-box classifiers in a wide range of applications such as image classification, textual entailment, and machine translation. We include experiments to show that the generated adversaries are natural, legible to humans, and useful in evaluating and analyzing black-box classifiers.@inproceedings{natadv:mldecept17, author = {Zhengli Zhao and Dheeru Dua and Sameer Singh}, title = { {Generating Natural Adversarial Examples} }, booktitle = {NeurIPS Workshop on Machine Deception}, year = {2017} }
- 
        , , .Model-Agnostic Interpretability of Machine Learning.
        
        ICML Workshop on Human Interpretability in Machine Learning (WHI).
        
        
        2016
        
        Workshop
 Best Paper Award
 [ PDF, BibTex ]@inproceedings{lime:whi16, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {Model-Agnostic Interpretability of Machine Learning} }, booktitle = {ICML Workshop on Human Interpretability in Machine Learning (WHI)}, month = {June}, year = {2016} }
- 
        , , ."Why Should I Trust You?": Explaining the Predictions of Any Classifier.
        
        Knowledge Discovery and Data Mining (KDD).
        
        
        2016
        
        Conference
 Audience Appreciation Award
 Also presented at the CHI 2016 Workshop on Human-Centred Machine Learning (HCML).
 [ PDF, arXiv, Code, Video, O'Reilly, Code (experiments), ACM Page, BibTex ]@inproceedings{lime:kdd16, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {"Why Should I Trust You?": Explaining the Predictions of Any Classifier} }, booktitle = {Knowledge Discovery and Data Mining (KDD)}, month = {August}, doi = {10.1145/2939672.2939778}, pages = {1135-1144}, year = {2016} }
- 
        , .Collective Factorization for Relational Data: An Evaluation on the Yelp Datasets.
        
        Technical Report, Yelp Dataset Challenge, Round 4.
        
        
        2015
        
        TechReport
 Grand Prize Winner of Yelp Dataset Challenge Round 4
 [ PDF, Website, Yelp Challenge, BibTex ]@techreport{factordb:yelp15, author = {Nitish Gupta and Sameer Singh}, title = { {Collective Factorization for Relational Data: An Evaluation on the Yelp Datasets} }, institution = {Yelp Dataset Challenge, Round 4}, year = {2015} }
- 
        , , , .Low-dimensional Embeddings of Logic.
        
        ACL 2014 Workshop on Semantic Parsing (SP14).
        
        
        2014
        
        Workshop
 Exceptional Submission Award
 Also presented at StarAI 2014 with minor changes.
 [ PDF, Poster, BibTex ]@inproceedings{logic:sp14, author = {Tim Rocktaschel and Sameer Singh and Matko Bosnjak and Sebastian Riedel}, title = { {Low-dimensional Embeddings of Logic} }, booktitle = {ACL 2014 Workshop on Semantic Parsing (SP14)}, year = {2014} }
Grants
- 
            
            CAREER: Detecting, Understanding, and Fixing Vulnerabilities in Natural Language Processing Models
 National Science Foundation (NSF)
 PI ; $500,000 (2021-2026)
- 
            
            REACT Team in the Reverse Engineering of Deceptions (RED) Program
 Defense Advanced Research Projects Agency (DARPA)
 Subcontract (led by Daniel Lowd from University of Oregon) (2020-2023)
- 
            
            EAGER: SaTC-EDU: Multi-Level Attack and Defense Simulation Environment for Artificial Intelligence Education and Research
 National Science Foundation (NSF)
 Co-PI (with PI Zhou Li and Co-PI Sergio Gago Masague) ; $300,000 (2020-2022)
- 
            
            Collaborative Research: RI: Small: Post hoc Explanations in the Wild: Exposing Vulnerabilities and Ensuring Robustness
 National Science Foundation (NSF)
 PI (Collaborative with Himabindu Lakkaraju) ; $450,000 (total) (2020-2023)
- 
            
            CCRI: ENS: Machine Learning Democratization via a Linked, Annotated Repository of Datasets
 National Science Foundation (NSF)
 PI (with Co-PIs Padhraic Smyth and Philip Papadoupoulous) ; $1,800,000 (2019-2022)
- 
            
            PLEDGES Team in the Learning with Limited Labels (LwLL) Program
 Defense Advanced Research Projects Agency (DARPA)
 Subcontract (led by Charles River Analytics) (2019-2021)
- 
            
            MOWGLI Team in the Machine Common Sense (MCS) Program
 Defense Advanced Research Projects Agency (DARPA)
 Subcontract (led by USC/ISI) (2019-2023)
- 
            
            ZotBot team in the Alexa Socialbot Grand Challenge
 Amazon Inc.
 PI/Faculty Advisor ; $250,000 (2019-2020)
- 
            
            RI: Small: Modeling Multiple Modalities for Knowledge-Base Construction
 National Science Foundation (NSF)
 PI ; $450,000 (2018-2021)
- 
            
            CRII: RI: Explaining Decisions of Black-box Models via Input Perturbations
 National Science Foundation (NSF)
 PI ; $175,000 (2018-2021)