Dipanjan Das


2020

pdf bib
Syntactic Data Augmentation Increases Robustness to Inference Heuristics
Junghyun Min | R. Thomas McCoy | Dipanjan Das | Emily Pitler | Tal Linzen
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Pretrained neural models such as BERT, when fine-tuned to perform natural language inference (NLI), often show high accuracy on standard datasets, but display a surprising lack of sensitivity to word order on controlled challenge sets. We hypothesize that this issue is not primarily caused by the pretrained model’s limitations, but rather by the paucity of crowdsourced NLI examples that might convey the importance of syntactic structure at the fine-tuning stage. We explore several methods to augment standard training sets with syntactically informative examples, generated by applying syntactic transformations to sentences from the MNLI corpus. The best-performing augmentation method, subject/object inversion, improved BERT’s accuracy on controlled examples that diagnose sensitivity to word order from 0.28 to 0.73, without affecting performance on the MNLI test set. This improvement generalized beyond the particular construction used for data augmentation, suggesting that augmentation causes BERT to recruit abstract syntactic representations.

pdf bib
BLEURT: Learning Robust Metrics for Text Generation
Thibault Sellam | Dipanjan Das | Ankur Parikh
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgment. We propose BLEURT, a learned evaluation metric for English based on BERT. BLEURT can model human judgment with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG data set. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.

pdf bib
ToTTo: A Controlled Table-To-Text Generation Dataset
Ankur Parikh | Xuezhi Wang | Sebastian Gehrmann | Manaal Faruqui | Bhuwan Dhingra | Diyi Yang | Dipanjan Das
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present ToTTo, an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. To obtain generated targets that are natural but also faithful to the source table, we introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. We present systematic analyses of our dataset and annotation process as well as results achieved by several state-of-the-art baselines. While usually fluent, existing methods often hallucinate phrases that are not supported by the table, suggesting that this dataset can serve as a useful research benchmark for high-precision conditional text generation.

2019

pdf bib
BERT Rediscovers the Classical NLP Pipeline
Ian Tenney | Dipanjan Das | Ellie Pavlick
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Pre-trained text encoders have rapidly advanced the state of the art on many NLP tasks. We focus on one such model, BERT, and aim to quantify where linguistic information is captured within the network. We find that the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way, and that the regions responsible for each step appear in the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference. Qualitative analysis reveals that the model can and often does adjust this pipeline dynamically, revising lower-level decisions on the basis of disambiguating information from higher-level representations.

pdf bib
Handling Divergent Reference Texts when Evaluating Table-to-Text Generation
Bhuwan Dhingra | Manaal Faruqui | Ankur Parikh | Ming-Wei Chang | Dipanjan Das | William Cohen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Automatically constructed datasets for generating text from semi-structured data (tables), such as WikiBio, often contain reference texts that diverge from the information in the corresponding semi-structured data. We show that metrics which rely solely on the reference texts, such as BLEU and ROUGE, show poor correlation with human judgments when those references diverge. We propose a new metric, PARENT, which aligns n-grams from the reference and generated texts to the semi-structured data before computing their precision and recall. Through a large scale human evaluation study of table-to-text models for WikiBio, we show that PARENT correlates with human judgments better than existing text generation metrics. We also adapt and evaluate the information extraction based evaluation proposed by Wiseman et al (2017), and show that PARENT has comparable correlation to it, while being easier to use. We show that PARENT is also applicable when the reference texts are elicited from humans using the data from the WebNLG challenge.

pdf bib
Text Generation with Exemplar-based Adaptive Decoding
Hao Peng | Ankur Parikh | Manaal Faruqui | Bhuwan Dhingra | Dipanjan Das
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose a novel conditioned text generation model. It draws inspiration from traditional template-based text generation techniques, where the source provides the content (i.e., what to say), and the template influences how to say it. Building on the successful encoder-decoder paradigm, it first encodes the content representation from the given input text; to produce the output, it retrieves exemplar text from the training data as “soft templates,” which are then used to construct an exemplar-specific decoder. We evaluate the proposed model on abstractive text summarization and data-to-text generation. Empirical results show that this model achieves strong performance and outperforms comparable baselines.

2018

pdf bib
WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse
Manaal Faruqui | Ellie Pavlick | Ian Tenney | Dipanjan Das
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We release a corpus of 43 million atomic edits across 8 languages. These edits are mined from Wikipedia edit history and consist of instances in which a human editor has inserted a single contiguous phrase into, or deleted a single contiguous phrase from, an existing sentence. We use the collected data to show that the language generated during editing differs from the language that we observe in standard corpora, and that models trained on edits encode different aspects of semantics and discourse than models trained on raw text. We release the full corpus as a resource to aid ongoing research in semantics, discourse, and representation learning.

pdf bib
Learning To Split and Rephrase From Wikipedia Edit History
Jan A. Botha | Manaal Faruqui | John Alex | Jason Baldridge | Dipanjan Das
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Split and rephrase is the task of breaking down a sentence into shorter ones that together convey the same meaning. We extract a rich new dataset for this task by mining Wikipedia’s edit history: WikiSplit contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan et al. (2017) as a benchmark for this task. Incorporating WikiSplit as training data produces a model with qualitatively better predictions that score 32 BLEU points above the prior best result on the WebSplit benchmark.

pdf bib
Identifying Well-formed Natural Language Questions
Manaal Faruqui | Dipanjan Das
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Understanding search queries is a hard problem as it involves dealing with “word salad” text ubiquitously issued by users. However, if a query resembles a well-formed question, a natural language processing pipeline is able to perform more accurate interpretation, thus reducing downstream compounding errors. Hence, identifying whether or not a query is well formed can enhance query understanding. Here, we introduce a new task of identifying a well-formed natural language question. We construct and release a dataset of 25,100 publicly available questions classified into well-formed and non-wellformed categories and report an accuracy of 70.7% on the test set. We also show that our classifier can be used to improve the performance of neural sequence-to-sequence models for generating questions for reading comprehension.

2017

pdf bib
Neural Paraphrase Identification of Questions with Noisy Pretraining
Gaurav Singh Tomar | Thyago Duque | Oscar Täckström | Jakob Uszkoreit | Dipanjan Das
Proceedings of the First Workshop on Subword and Character Level Models in NLP

We present a solution to the problem of paraphrase identification of questions. We focus on a recent dataset of question pairs annotated with binary paraphrase labels and show that a variant of the decomposable attention model (replacing the word embeddings of the decomposable attention model of Parikh et al. 2016 with character n-gram representations) results in accurate performance on this task, while being far simpler than many competing neural architectures. Furthermore, when the model is pretrained on a noisy dataset of automatically collected question paraphrases, it obtains the best reported performance on the dataset.

2016

pdf bib
A Decomposable Attention Model for Natural Language Inference
Ankur Parikh | Oscar Täckström | Dipanjan Das | Jakob Uszkoreit
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP
Dipanjan Das | Chris Dyer | Manaal Faruqui | Yulia Tsvetkov
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

pdf bib
Transforming Dependency Structures to Logical Forms for Semantic Parsing
Siva Reddy | Oscar Täckström | Michael Collins | Tom Kwiatkowski | Dipanjan Das | Mark Steedman | Mirella Lapata
Transactions of the Association for Computational Linguistics, Volume 4

The strongly typed syntax of grammar formalisms such as CCG, TAG, LFG and HPSG offers a synchronous framework for deriving syntactic structures and semantic logical forms. In contrast—partly due to the lack of a strong type system—dependency structures are easy to annotate and have become a widely used form of syntactic analysis for many languages. However, the lack of a type system makes a formal mechanism for deriving logical forms from dependency structures challenging. We address this by introducing a robust system based on the lambda calculus for deriving neo-Davidsonian logical forms from dependency trees. These logical forms are then used for semantic parsing of natural language to Freebase. Experiments on the Free917 and Web-Questions datasets show that our representation is superior to the original dependency trees and that it outperforms a CCG-based representation on this task. Compared to prior work, we obtain the strongest result to date on Free917 and competitive results on WebQuestions.

2015

pdf bib
Semantic Role Labeling with Neural Network Factors
Nicholas FitzGerald | Oscar Täckström | Kuzman Ganchev | Dipanjan Das
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Efficient Inference and Structured Learning for Semantic Role Labeling
Oscar Täckström | Kuzman Ganchev | Dipanjan Das
Transactions of the Association for Computational Linguistics, Volume 3

We present a dynamic programming algorithm for efficient constrained inference in semantic role labeling. The algorithm tractably captures a majority of the structural constraints examined by prior work in this area, which has resorted to either approximate methods or off-the-shelf integer linear programming solvers. In addition, it allows training a globally-normalized log-linear model with respect to constrained conditional likelihood. We show that the dynamic program is several times faster than an off-the-shelf integer linear programming solver, while reaching the same solution. Furthermore, we show that our structured model results in significant improvements over its local counterpart, achieving state-of-the-art results on both PropBank- and FrameNet-annotated corpora.

2014

pdf bib
Frame-Semantic Parsing
Dipanjan Das | Desai Chen | André F. T. Martins | Nathan Schneider | Noah A. Smith
Computational Linguistics, Volume 40, Issue 1 - March 2014

pdf bib
Semantic Frame Identification with Distributed Word Representations
Karl Moritz Hermann | Dipanjan Das | Jason Weston | Kuzman Ganchev
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Enhanced Search with Wildcards and Morphological Inflections in the Google Books Ngram Viewer
Jason Mann | David Zhang | Lu Yang | Dipanjan Das | Slav Petrov
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Statistical Models for Frame-Semantic Parsing
Dipanjan Das
Proceedings of Frame Semantics in NLP: A Workshop in Honor of Chuck Fillmore (1929-2014)

pdf bib
Learning Compact Lexicons for CCG Semantic Parsing
Yoav Artzi | Dipanjan Das | Slav Petrov
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
Kuzman Ganchev | Dipanjan Das
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging
Oscar Täckström | Dipanjan Das | Slav Petrov | Ryan McDonald | Joakim Nivre
Transactions of the Association for Computational Linguistics, Volume 1

We consider the construction of part-of-speech taggers for resource-poor languages. Recently, manually constructed tag dictionaries from Wiktionary and dictionaries projected via bitext have been used as type constraints to overcome the scarcity of annotated data in this setting. In this paper, we show that additional token constraints can be projected from a resource-rich source language to a resource-poor target language via word-aligned bitext. We present several models to this end; in particular a partially observed conditional random field model, where coupled token and type constraints provide a partial signal for training. Averaged across eight previously studied Indo-European languages, our model achieves a 25% relative error reduction over the prior state of the art. We further present successful results on seven additional languages from different families, empirically demonstrating the applicability of coupled token and type constraints across a diverse set of languages.

pdf bib
Universal Dependency Annotation for Multilingual Parsing
Ryan McDonald | Joakim Nivre | Yvonne Quirmbach-Brundage | Yoav Goldberg | Dipanjan Das | Kuzman Ganchev | Keith Hall | Slav Petrov | Hao Zhang | Oscar Täckström | Claudia Bedini | Núria Bertomeu Castelló | Jungmee Lee
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties
Dipanjan Das | Noah A. Smith
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
An Exact Dual Decomposition Algorithm for Shallow Semantic Parsing with Constraints
Dipanjan Das | André F. T. Martins | Noah A. Smith
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
A Universal Part-of-Speech Tagset
Slav Petrov | Dipanjan Das | Ryan McDonald
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

To facilitate future research in unsupervised induction of syntactic structure and to standardize best-practices, we propose a tagset that consists of twelve universal part-of-speech categories. In addition to the tagset, we develop a mapping from 25 different treebank tagsets to this universal set. As a result, when combined with the original treebank data, this universal tagset and mapping produce a dataset consisting of common parts-of-speech for 22 different languages. We highlight the use of this resource via three experiments, that (1) compare tagging accuracies across languages, (2) present an unsupervised grammar induction approach that does not use gold standard part-of-speech tags, and (3) use the universal tags to transfer dependency parsers between languages, achieving state-of-the-art results.

2011

pdf bib
Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance
Shay B. Cohen | Dipanjan Das | Noah A. Smith
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
Dipanjan Das | Slav Petrov
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Semi-Supervised Frame-Semantic Parsing for Unknown Predicates
Dipanjan Das | Noah A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
Kevin Gimpel | Nathan Schneider | Brendan O’Connor | Dipanjan Das | Daniel Mills | Jacob Eisenstein | Michael Heilman | Dani Yogatama | Jeffrey Flanigan | Noah A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Movie Reviews and Revenues: An Experiment in Text Regression
Mahesh Joshi | Dipanjan Das | Kevin Gimpel | Noah A. Smith
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Probabilistic Frame-Semantic Parsing
Dipanjan Das | Nathan Schneider | Desai Chen | Noah A. Smith
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Distributed Asynchronous Online Learning for Natural Language Processing
Kevin Gimpel | Dipanjan Das | Noah A. Smith
Proceedings of the Fourteenth Conference on Computational Natural Language Learning

pdf bib
SEMAFOR: Frame Argument Resolution with Log-Linear Models
Desai Chen | Nathan Schneider | Dipanjan Das | Noah A. Smith
Proceedings of the 5th International Workshop on Semantic Evaluation

2009

pdf bib
Non-textual Event Summarization by Applying Machine Learning to Template-based Language Generation
Mohit Kumar | Dipanjan Das | Sachin Agarwal | Alexander Rudnicky
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)

pdf bib
Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition
Dipanjan Das | Noah A. Smith
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Automatic Extraction of Briefing Templates
Dipanjan Das | Mohit Kumar | Alexander I. Rudnicky
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Stacking Dependency Parsers
André Filipe Torres Martins | Dipanjan Das | Noah A. Smith | Eric P. Xing
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing