Manaal Faruqui


2020

pdf bib
ToTTo: A Controlled Table-To-Text Generation Dataset
Ankur Parikh | Xuezhi Wang | Sebastian Gehrmann | Manaal Faruqui | Bhuwan Dhingra | Diyi Yang | Dipanjan Das
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present ToTTo, an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. To obtain generated targets that are natural but also faithful to the source table, we introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. We present systematic analyses of our dataset and annotation process as well as results achieved by several state-of-the-art baselines. While usually fluent, existing methods often hallucinate phrases that are not supported by the table, suggesting that this dataset can serve as a useful research benchmark for high-precision conditional text generation.

pdf bib
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics
Iryna Gurevych | Marianna Apidianaki | Manaal Faruqui
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

2019

pdf bib
Handling Divergent Reference Texts when Evaluating Table-to-Text Generation
Bhuwan Dhingra | Manaal Faruqui | Ankur Parikh | Ming-Wei Chang | Dipanjan Das | William Cohen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Automatically constructed datasets for generating text from semi-structured data (tables), such as WikiBio, often contain reference texts that diverge from the information in the corresponding semi-structured data. We show that metrics which rely solely on the reference texts, such as BLEU and ROUGE, show poor correlation with human judgments when those references diverge. We propose a new metric, PARENT, which aligns n-grams from the reference and generated texts to the semi-structured data before computing their precision and recall. Through a large scale human evaluation study of table-to-text models for WikiBio, we show that PARENT correlates with human judgments better than existing text generation metrics. We also adapt and evaluate the information extraction based evaluation proposed by Wiseman et al (2017), and show that PARENT has comparable correlation to it, while being easier to use. We show that PARENT is also applicable when the reference texts are elicited from humans using the data from the WebNLG challenge.

pdf bib
Proceedings of TyP-NLP: The First Workshop on Typology for Polyglot NLP
Haim Dubossarsky | Arya D. McCarthy | Edoardo Maria Ponti | Ivan Vulić | Ekaterina Vylomova | Yevgeni Berzak | Ryan Cotterell | Manaal Faruqui | Anna Korhonen | Roi Reichart
Proceedings of TyP-NLP: The First Workshop on Typology for Polyglot NLP

pdf bib
Text Generation with Exemplar-based Adaptive Decoding
Hao Peng | Ankur Parikh | Manaal Faruqui | Bhuwan Dhingra | Dipanjan Das
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose a novel conditioned text generation model. It draws inspiration from traditional template-based text generation techniques, where the source provides the content (i.e., what to say), and the template influences how to say it. Building on the successful encoder-decoder paradigm, it first encodes the content representation from the given input text; to produce the output, it retrieves exemplar text from the training data as “soft templates,” which are then used to construct an exemplar-specific decoder. We evaluate the proposed model on abstractive text summarization and data-to-text generation. Empirical results show that this model achieves strong performance and outperforms comparable baselines.

2018

pdf bib
GHH at SemEval-2018 Task 10: Discovering Discriminative Attributes in Distributional Semantics
Mohammed Attia | Younes Samih | Manaal Faruqui | Wolfgang Maier
Proceedings of The 12th International Workshop on Semantic Evaluation

This paper describes our system submission to the SemEval 2018 Task 10 on Capturing Discriminative Attributes. Given two concepts and an attribute, the task is to determine whether the attribute is semantically related to one concept and not the other. In this work we assume that discriminative attributes can be detected by discovering the association (or lack of association) between a pair of words. The hypothesis we test in this contribution is whether the semantic difference between two pairs of concepts can be treated in terms of measuring the distance between words in a vector space, or can simply be obtained as a by-product of word co-occurrence counts.

pdf bib
UniMorph 2.0: Universal Morphology
Christo Kirov | Ryan Cotterell | John Sylak-Glassman | Géraldine Walther | Ekaterina Vylomova | Patrick Xia | Manaal Faruqui | Sabrina J. Mielke | Arya McCarthy | Sandra Kübler | David Yarowsky | Jason Eisner | Mans Hulden
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse
Manaal Faruqui | Ellie Pavlick | Ian Tenney | Dipanjan Das
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We release a corpus of 43 million atomic edits across 8 languages. These edits are mined from Wikipedia edit history and consist of instances in which a human editor has inserted a single contiguous phrase into, or deleted a single contiguous phrase from, an existing sentence. We use the collected data to show that the language generated during editing differs from the language that we observe in standard corpora, and that models trained on edits encode different aspects of semantics and discourse than models trained on raw text. We release the full corpus as a resource to aid ongoing research in semantics, discourse, and representation learning.

pdf bib
Learning To Split and Rephrase From Wikipedia Edit History
Jan A. Botha | Manaal Faruqui | John Alex | Jason Baldridge | Dipanjan Das
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Split and rephrase is the task of breaking down a sentence into shorter ones that together convey the same meaning. We extract a rich new dataset for this task by mining Wikipedia’s edit history: WikiSplit contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan et al. (2017) as a benchmark for this task. Incorporating WikiSplit as training data produces a model with qualitatively better predictions that score 32 BLEU points above the prior best result on the WebSplit benchmark.

pdf bib
Identifying Well-formed Natural Language Questions
Manaal Faruqui | Dipanjan Das
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Understanding search queries is a hard problem as it involves dealing with “word salad” text ubiquitously issued by users. However, if a query resembles a well-formed question, a natural language processing pipeline is able to perform more accurate interpretation, thus reducing downstream compounding errors. Hence, identifying whether or not a query is well formed can enhance query understanding. Here, we introduce a new task of identifying a well-formed natural language question. We construct and release a dataset of 25,100 publicly available questions classified into well-formed and non-wellformed categories and report an accuracy of 70.7% on the test set. We also show that our classifier can be used to improve the performance of neural sequence-to-sequence models for generating questions for reading comprehension.

pdf bib
Proceedings of the Second Workshop on Subword/Character LEvel Models
Manaal Faruqui | Hinrich Schütze | Isabel Trancoso | Yulia Tsvetkov | Yadollah Yaghoobzadeh
Proceedings of the Second Workshop on Subword/Character LEvel Models

2017

pdf bib
CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages
Ryan Cotterell | Christo Kirov | John Sylak-Glassman | Géraldine Walther | Ekaterina Vylomova | Patrick Xia | Manaal Faruqui | Sandra Kübler | David Yarowsky | Jason Eisner | Mans Hulden
Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection

pdf bib
Proceedings of the First Workshop on Subword and Character Level Models in NLP
Manaal Faruqui | Hinrich Schuetze | Isabel Trancoso | Yadollah Yaghoobzadeh
Proceedings of the First Workshop on Subword and Character Level Models in NLP

pdf bib
Cross-Lingual Word Representations: Induction and Evaluation
Manaal Faruqui | Anders Søgaard | Ivan Vulić
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

In recent past, NLP as a field has seen tremendous utility of distributional word vector representations as features in downstream tasks. The fact that these word vectors can be trained on unlabeled monolingual corpora of a language makes them an inexpensive resource in NLP. With the increasing use of monolingual word vectors, there is a need for word vectors that can be used as efficiently across multiple languages as monolingually. Therefore, learning bilingual and multilingual word embeddings/vectors is currently an important research topic. These vectors offer an elegant and language-pair independent way to represent content across different languages.This tutorial aims to bring NLP researchers up to speed with the current techniques in cross-lingual word representation learning. We will first discuss how to induce cross-lingual word representations (covering both bilingual and multilingual ones) from various data types and resources (e.g., parallel data, comparable data, non-aligned monolingual data in different languages, dictionaries and theasuri, or, even, images, eye-tracking data). We will then discuss how to evaluate such representations, intrinsically and extrinsically. We will introduce researchers to state-of-the-art methods for constructing cross-lingual word representations and discuss their applicability in a broad range of downstream NLP applications.We will deliver a detailed survey of the current methods, discuss best training and evaluation practices and use-cases, and provide links to publicly available implementations, datasets, and pre-trained models.

2016

pdf bib
Morphological Inflection Generation Using Character Sequence to Sequence Learning
Manaal Faruqui | Yulia Tsvetkov | Graham Neubig | Chris Dyer
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning
Yulia Tsvetkov | Sunayana Sitaram | Manaal Faruqui | Guillaume Lample | Patrick Littell | David Mortensen | Alan W Black | Lori Levin | Chris Dyer
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning
Yulia Tsvetkov | Manaal Faruqui | Wang Ling | Brian MacWhinney | Chris Dyer
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Cross-lingual Models of Word Embeddings: An Empirical Comparison
Shyam Upadhyay | Manaal Faruqui | Chris Dyer | Dan Roth
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP
Dipanjan Das | Chris Dyer | Manaal Faruqui | Yulia Tsvetkov
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

pdf bib
Problems With Evaluation of Word Embeddings Using Word Similarity Tasks
Manaal Faruqui | Yulia Tsvetkov | Pushpendre Rastogi | Chris Dyer
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

pdf bib
Correlation-based Intrinsic Evaluation of Word Vector Representations
Yulia Tsvetkov | Manaal Faruqui | Chris Dyer
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

pdf bib
Morpho-syntactic Lexicon Generation Using Graph-based Semi-supervised Learning
Manaal Faruqui | Ryan McDonald | Radu Soricut
Transactions of the Association for Computational Linguistics, Volume 4

Morpho-syntactic lexicons provide information about the morphological and syntactic roles of words in a language. Such lexicons are not available for all languages and even when available, their coverage can be limited. We present a graph-based semi-supervised learning method that uses the morphological, syntactic and semantic relations between words to automatically construct wide coverage lexicons from small seed sets. Our method is language-independent, and we show that we can expand a 1000 word seed lexicon to more than 100 times its size with high quality for 11 languages. In addition, the automatically created lexicons provide features that improve performance in two downstream tasks: morphological tagging and dependency parsing.

2015

pdf bib
Evaluation of Word Vector Representations by Subspace Alignment
Yulia Tsvetkov | Manaal Faruqui | Wang Ling | Guillaume Lample | Chris Dyer
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Multilingual Open Relation Extraction Using Cross-lingual Projection
Manaal Faruqui | Shankar Kumar
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Retrofitting Word Vectors to Semantic Lexicons
Manaal Faruqui | Jesse Dodge | Sujay Kumar Jauhar | Chris Dyer | Eduard Hovy | Noah A. Smith
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Sparse Overcomplete Word Vector Representations
Manaal Faruqui | Yulia Tsvetkov | Dani Yogatama | Chris Dyer | Noah A. Smith
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Non-distributional Word Vector Representations
Manaal Faruqui | Chris Dyer
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
Community Evaluation and Exchange of Word Vectors at wordvectors.org
Manaal Faruqui | Chris Dyer
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Improving Vector Space Word Representations Using Multilingual Correlation
Manaal Faruqui | Chris Dyer
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Augmenting English Adjective Senses with Supersenses
Yulia Tsvetkov | Nathan Schneider | Dirk Hovy | Archna Bhatia | Manaal Faruqui | Chris Dyer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We develop a supersense taxonomy for adjectives, based on that of GermaNet, and apply it to English adjectives in WordNet using human annotation and supervised classification. Results show that accuracy for automatic adjective type classification is high, but synsets are considerably more difficult to classify, even for trained human annotators. We release the manually annotated data, the classifier, and the induced supersense labeling of 12,304 WordNet adjective synsets.

2013

pdf bib
Identifying the L1 of non-native writers: the CMU-Haifa system
Yulia Tsvetkov | Naama Twitto | Nathan Schneider | Noam Ordan | Manaal Faruqui | Victor Chahuneau | Shuly Wintner | Chris Dyer
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
A Framework for (Under)specifying Dependency Syntax without Overloading Annotators
Nathan Schneider | Brendan O’Connor | Naomi Saphra | David Bamman | Manaal Faruqui | Noah A. Smith | Chris Dyer | Jason Baldridge
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

pdf bib
An Information Theoretic Approach to Bilingual Word Clustering
Manaal Faruqui | Chris Dyer
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Towards a model of formal and informal address in English
Manaal Faruqui | Sebastian Padó
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf bib
Acquiring entailment pairs across languages and domains: A Data Analysis
Manaal Faruqui | Sebastian Padó
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)

pdf bib
Soundex-based Translation Correction in Urdu–English Cross-Language Information Retrieval
Manaal Faruqui | Prasenjit Majumder | Sebastian Padó
Proceedings of the Fifth International Workshop On Cross Lingual Information Access

pdf bib
I Thou Thee, Thou Traitor”: Predicting Formal vs. Informal Address in English Literature
Manaal Faruqui | Sebastian Padó
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies