Gerhard Weikum


2020

pdf bib
RedDust: a Large Reusable Dataset of Reddit User Traits
Anna Tigunova | Paramita Mirza | Andrew Yates | Gerhard Weikum
Proceedings of the 12th Language Resources and Evaluation Conference

Social media is a rich source of assertions about personal traits, such as “I am a doctor” or “my hobby is playing tennis”. Precisely identifying explicit assertions is difficult, though, because of the users’ highly varied vocabulary and language expressions. Identifying personal traits from implicit assertions like I’ve been at work treating patients all day is even more challenging. This paper presents RedDust, a large-scale annotated resource for user profiling for over 300k Reddit users across five attributes: profession, hobby, family status, age,and gender. We construct RedDust using a diverse set of high-precision patterns and demonstrate its use as a resource for developing learning models to deal with implicit assertions. RedDust consists of users’ personal traits, which are (attribute, value) pairs, along with users’ post ids, which may be used to retrieve the posts from a publicly available crawl or from the Reddit API. We discuss the construction of the resource and show interesting statistics and insights into the data. We also compare different classifiers, which can be learned from RedDust. To the best of our knowledge, RedDust is the first annotated language resource about Reddit users at large scale. We envision further use cases of RedDust for providing background knowledge about user traits, to enhance personalized search and recommendation as well as conversational agents.

pdf bib
CHARM: Inferring Personal Attributes from Conversations
Anna Tigunova | Andrew Yates | Paramita Mirza | Gerhard Weikum
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Personal knowledge about users’ professions, hobbies, favorite food, and travel preferences, among others, is a valuable asset for individualized AI, such as recommenders or chatbots. Conversations in social media, such as Reddit, are a rich source of data for inferring personal facts. Prior work developed supervised methods to extract this knowledge, but these approaches can not generalize beyond attribute values with ample labeled training samples. This paper overcomes this limitation by devising CHARM: a zero-shot learning method that creatively leverages keyword extraction and document retrieval in order to predict attribute values that were never seen during training. Experiments with large datasets from Reddit show the viability of CHARM for open-ended attributes, such as professions and hobbies.

pdf bib
ENTYFI: A System for Fine-grained Entity Typing in Fictional Texts
Cuong Xuan Chu | Simon Razniewski | Gerhard Weikum
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Fiction and fantasy are archetypes of long-tail domains that lack suitable NLP methodologies and tools. We present ENTYFI, a web-based system for fine-grained typing of entity mentions in fictional texts. It builds on 205 automatically induced high-quality type systems for popular fictional domains, and provides recommendations towards reference type systems for given input texts. Users can exploit the richness and diversity of these reference type systems for fine-grained supervised typing, in addition, they can choose among and combine four other typing modules: pre-trained real-world models, unsupervised dependency-based typing, knowledge base lookups, and constraint-based candidate consolidation. The demonstrator is available at: https://d5demos.mpi-inf.mpg.de/entyfi.

2019

pdf bib
Coverage of Information Extraction from Sentences and Paragraphs
Simon Razniewski | Nitisha Jain | Paramita Mirza | Gerhard Weikum
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Scalar implicatures are language features that imply the negation of stronger statements, e.g., “She was married twice” typically implicates that she was not married thrice. In this paper we discuss the importance of scalar implicatures in the context of textual information extraction. We investigate how textual features can be used to predict whether a given text segment mentions all objects standing in a certain relationship with a certain subject. Preliminary results on Wikipedia indicate that this prediction is feasible, and yields informative assessments.

pdf bib
STANCY: Stance Classification Based on Consistency Cues
Kashyap Popat | Subhabrata Mukherjee | Andrew Yates | Gerhard Weikum
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Controversial claims are abundant in online media and discussion forums. A better understanding of such claims requires analyzing them from different perspectives. Stance classification is a necessary step for inferring these perspectives in terms of supporting or opposing the claim. In this work, we present a neural network model for stance classification leveraging BERT representations and augmenting them with a novel consistency constraint. Experiments on the Perspectrum dataset, consisting of claims and users’ perspectives from various debate websites, demonstrate the effectiveness of our approach over state-of-the-art baselines.

pdf bib
Neural Relation Extraction for Knowledge Base Enrichment
Bayu Distiawan Trisedya | Gerhard Weikum | Jianzhong Qi | Rui Zhang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We study relation extraction for knowledge base (KB) enrichment. Specifically, we aim to extract entities and their relationships from sentences in the form of triples and map the elements of the extracted triples to an existing KB in an end-to-end manner. Previous studies focus on the extraction itself and rely on Named Entity Disambiguation (NED) to map triples into the KB space. This way, NED errors may cause extraction errors that affect the overall precision and recall.To address this problem, we propose an end-to-end relation extraction model for KB enrichment based on a neural encoder-decoder model. We collect high-quality training data by distant supervision with co-reference resolution and paraphrase detection. We propose an n-gram based attention model that captures multi-word entity names in a sentence. Our model employs jointly learned word and entity embeddings to support named entity disambiguation. Finally, our model uses a modified beam search and a triple classifier to help generate high-quality triples. Our model outperforms state-of-the-art baselines by 15.51% and 8.38% in terms of F1 score on two real-world datasets.

pdf bib
ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters
Abdalghani Abujabal | Rishiraj Saha Roy | Mohamed Yahya | Gerhard Weikum
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

To bridge the gap between the capabilities of the state-of-the-art in factoid question answering (QA) and what users ask, we need large datasets of real user questions that capture the various question phenomena users are interested in, and the diverse ways in which these questions are formulated. We introduce ComQA, a large dataset of real user questions that exhibit different challenging aspects such as compositionality, temporal reasoning, and comparisons. ComQA questions come from the WikiAnswers community QA platform, which typically contains questions that are not satisfactorily answerable by existing search engine technology. Through a large crowdsourcing effort, we clean the question dataset, group questions into paraphrase clusters, and annotate clusters with their answers. ComQA contains 11,214 questions grouped into 4,834 paraphrase clusters. We detail the process of constructing ComQA, including the measures taken to ensure its high quality while making effective use of crowdsourcing. We also present an extensive analysis of the dataset and the results achieved by state-of-the-art systems on ComQA, demonstrating that our dataset can be a driver of future research on QA.

2018

pdf bib
DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning
Kashyap Popat | Subhabrata Mukherjee | Andrew Yates | Gerhard Weikum
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Misinformation such as fake news is one of the big challenges of our society. Research on automated fact-checking has proposed methods based on supervised learning, but these approaches do not consider external evidence apart from labeled training instances. Recent approaches counter this deficit by considering external sources related to a claim. However, these methods require substantial feature modeling and rich lexicons. This paper overcomes these limitations of prior work with an end-to-end model for evidence-aware credibility assessment of arbitrary textual claims, without any human intervention. It presents a neural network model that judiciously aggregates signals from external evidence articles, the language of these articles and the trustworthiness of their sources. It also derives informative features for generating user-comprehensible explanations that makes the neural network predictions transparent to the end-user. Experiments with four datasets and ablation studies show the strength of our method.

pdf bib
Facts That Matter
Marco Ponza | Luciano Del Corro | Gerhard Weikum
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This work introduces fact salience: The task of generating a machine-readable representation of the most prominent information in a text document as a set of facts. We also present SalIE, the first fact salience system. SalIE is unsupervised and knowledge agnostic, based on open information extraction to detect facts in natural language text, PageRank to determine their relevance, and clustering to promote diversity. We compare SalIE with several baselines (including positional, standard for saliency tasks), and in an extrinsic evaluation, with state-of-the-art automatic text summarizers. SalIE outperforms baselines and text summarizers showing that facts are an effective way to compress information.

pdf bib
A Study of the Importance of External Knowledge in the Named Entity Recognition Task
Dominic Seyler | Tatiana Dembelova | Luciano Del Corro | Johannes Hoffart | Gerhard Weikum
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this work, we discuss the importance of external knowledge for performing Named Entity Recognition (NER). We present a novel modular framework that divides the knowledge into four categories according to the depth of knowledge they convey. Each category consists of a set of features automatically generated from different information sources, such as a knowledge-base, a list of names, or document-specific semantic annotations. Further, we show the effects on performance when incrementally adding deeper knowledge and discuss effectiveness/efficiency trade-offs.

pdf bib
diaNED: Time-Aware Named Entity Disambiguation for Diachronic Corpora
Prabal Agarwal | Jannik Strötgen | Luciano del Corro | Johannes Hoffart | Gerhard Weikum
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Named Entity Disambiguation (NED) systems perform well on news articles and other texts covering a specific time interval. However, NED quality drops when inputs span long time periods like in archives or historic corpora. This paper presents the first time-aware method for NED that resolves ambiguities even when mention contexts give only few cues. The method is based on computing temporal signatures for entities and comparing these to the temporal contexts of input mentions. Our experiments show superior quality on a newly created diachronic corpus.

2017

pdf bib
Cardinal Virtues: Extracting Relation Cardinalities from Text
Paramita Mirza | Simon Razniewski | Fariz Darari | Gerhard Weikum
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Information extraction (IE) from text has largely focused on relations between individual entities, such as who has won which award. However, some facts are never fully mentioned, and no IE method has perfect recall. Thus, it is beneficial to also tap contents about the cardinalities of these relations, for example, how many awards someone has won. We introduce this novel problem of extracting cardinalities and discusses the specific challenges that set it apart from standard IE. We present a distant supervision method using conditional random fields. A preliminary evaluation results in precision between 3% and 55%, depending on the difficulty of relations.

pdf bib
WebChild 2.0 : Fine-Grained Commonsense Knowledge Distillation
Niket Tandon | Gerard de Melo | Gerhard Weikum
Proceedings of ACL 2017, System Demonstrations

pdf bib
Efficiency-aware Answering of Compositional Questions using Answer Type Prediction
David Ziegler | Abdalghani Abujabal | Rishiraj Saha Roy | Gerhard Weikum
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

This paper investigates the problem of answering compositional factoid questions over knowledge bases (KB) under efficiency constraints. The method, called TIPI, (i) decomposes compositional questions, (ii) predicts answer types for individual sub-questions, (iii) reasons over the compatibility of joint types, and finally, (iv) formulates compositional SPARQL queries respecting type constraints. TIPI’s answer type predictor is trained using distant supervision, and exploits lexical, syntactic and embedding-based features to compute context- and hierarchy-aware candidate answer types for an input question. Experiments on a recent benchmark show that TIPI results in state-of-the-art performance under the real-world assumption that only a single SPARQL query can be executed over the KB, and substantial reduction in the number of queries in the more general case.

pdf bib
QUINT: Interpretable Question Answering over Knowledge Bases
Abdalghani Abujabal | Rishiraj Saha Roy | Mohamed Yahya | Gerhard Weikum
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present QUINT, a live system for question answering over knowledge bases. QUINT automatically learns role-aligned utterance-query templates from user questions paired with their answers. When QUINT answers a question, it visualizes the complete derivation sequence from the natural language utterance to the final answer. The derivation provides an explanation of how the syntactic structure of the question was used to derive the structure of a SPARQL query, and how the phrases in the question were used to instantiate different parts of the query. When an answer seems unsatisfactory, the derivation provides valuable insights towards reformulating the question.

2016

pdf bib
POLY: Mining Relational Paraphrases from Multilingual Sentences
Adam Grycner | Gerhard Weikum
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
DeepLife: An Entity-aware Search, Analytics and Exploration Platform for Health and Life Sciences
Patrick Ernst | Amy Siu | Dragan Milchevski | Johannes Hoffart | Gerhard Weikum
Proceedings of ACL-2016 System Demonstrations

pdf bib
Know2Look: Commonsense Knowledge for Visual Search
Sreyasi Nag Chowdhury | Niket Tandon | Gerhard Weikum
Proceedings of the 5th Workshop on Automated Knowledge Base Construction

pdf bib
Disambiguation of entities in MEDLINE abstracts by combining MeSH terms with knowledge
Amy Siu | Patrick Ernst | Gerhard Weikum
Proceedings of the 15th Workshop on Biomedical Natural Language Processing

pdf bib
J-NERD: Joint Named Entity Recognition and Disambiguation with Rich Linguistic Features
Dat Ba Nguyen | Martin Theobald | Gerhard Weikum
Transactions of the Association for Computational Linguistics, Volume 4

Methods for Named Entity Recognition and Disambiguation (NERD) perform NER and NED in two separate stages. Therefore, NED may be penalized with respect to precision by NER false positives, and suffers in recall from NER false negatives. Conversely, NED does not fully exploit information computed by NER such as types of mentions. This paper presents J-NERD, a new approach to perform NER and NED jointly, by means of a probabilistic graphical model that captures mention spans, mention types, and the mapping of mentions to entities in a knowledge base. We present experiments with different kinds of texts from the CoNLL’03, ACE’05, and ClueWeb’09-FACC1 corpora. J-NERD consistently outperforms state-of-the-art competitors in end-to-end NERD precision, recall, and F1.

2015

pdf bib
C3EL: A Joint Model for Cross-Document Co-Reference Resolution and Entity Linking
Sourav Dutta | Gerhard Weikum
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
FINET: Context-Aware Fine-Grained Named Entity Typing
Luciano Del Corro | Abdalghani Abujabal | Rainer Gemulla | Gerhard Weikum
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
RELLY: Inferring Hypernym Relationships Between Relational Phrases
Adam Grycner | Gerhard Weikum | Jay Pujara | James Foulds | Lise Getoor
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
EDRAK: Entity-Centric Data Resource for Arabic Knowledge
Mohamed H. Gad-Elrab | Mohamed Amir Yosef | Gerhard Weikum
Proceedings of the Second Workshop on Arabic Natural Language Processing

pdf bib
Semantic Type Classification of Common Words in Biomedical Noun Phrases
Amy Siu | Gerhard Weikum
Proceedings of BioNLP 15

pdf bib
Cross-Document Co-Reference Resolution using Sample-Based Clustering with Knowledge Enrichment
Sourav Dutta | Gerhard Weikum
Transactions of the Association for Computational Linguistics, Volume 3

Identifying and linking named entities across information sources is the basis of knowledge acquisition and at the heart of Web search, recommendations, and analytics. An important problem in this context is cross-document co-reference resolution (CCR): computing equivalence classes of textual mentions denoting the same entity, within and across documents. Prior methods employ ranking, clustering, or probabilistic graphical models using syntactic features and distant features from knowledge bases. However, these methods exhibit limitations regarding run-time and robustness. This paper presents the CROCS framework for unsupervised CCR, improving the state of the art in two ways. First, we extend the way knowledge bases are harnessed, by constructing a notion of semantic summaries for intra-document co-reference chains using co-occurring entity mentions belonging to different chains. Second, we reduce the computational cost by a new algorithm that embeds sample-based bisection, using spectral clustering or graph partitioning, in a hierarchical clustering process. This allows scaling up CCR to large corpora. Experiments with three datasets show significant gains in output quality, compared to the best prior methods, and the run-time efficiency of CROCS.

2014

pdf bib
AIDArabic A Named-Entity Disambiguation Framework for Arabic Text
Mohamed Amir Yosef | Marc Spaniol | Gerhard Weikum
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf bib
Senti-LSSVM: Sentiment-Oriented Multi-Relation Extraction with Latent Structural SVM
Lizhen Qu | Yi Zhang | Rui Wang | Lili Jiang | Rainer Gemulla | Gerhard Weikum
Transactions of the Association for Computational Linguistics, Volume 2

Extracting instances of sentiment-oriented relations from user-generated web documents is important for online marketing analysis. Unlike previous work, we formulate this extraction task as a structured prediction problem and design the corresponding inference as an integer linear program. Our latent structural SVM based model can learn from training corpora that do not contain explicit annotations of sentiment-bearing expressions, and it can simultaneously recognize instances of both binary (polarity) and ternary (comparative) relations with regard to entity mentions of interest. The empirical evaluation shows that our approach significantly outperforms state-of-the-art systems across domains (cameras and movies) and across genres (reviews and forum posts). The gold standard corpus that we built will also be a valuable resource for the community.

pdf bib
HARPY: Hypernyms and Alignment of Relational Paraphrases
Adam Grycner | Gerhard Weikum
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Werdy: Recognition and Disambiguation of Verbs and Verb Phrases with Syntactic and Semantic Pruning
Luciano Del Corro | Rainer Gemulla | Gerhard Weikum
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Fine-grained Semantic Typing of Emerging Entities
Ndapandula Nakashole | Tomasz Tylenda | Gerhard Weikum
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text
Mohamed Amir Yosef | Sandro Bauer | Johannes Hoffart | Marc Spaniol | Gerhard Weikum
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

pdf bib
Coupling Label Propagation and Constraints for Temporal Fact Extraction
Yafang Wang | Maximilian Dylla | Marc Spaniol | Gerhard Weikum
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
UWN: A Large Multilingual Lexical Knowledge Base
Gerard de Melo | Gerhard Weikum
Proceedings of the ACL 2012 System Demonstrations

pdf bib
Real-time Population of Knowledge Bases: Opportunities and Challenges
Ndapandula Nakashole | Gerhard Weikum
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)

pdf bib
HYENA: Hierarchical Type Classification for Entity Names
Mohamed Amir Yosef | Sandro Bauer | Johannes Hoffart | Marc Spaniol | Gerhard Weikum
Proceedings of COLING 2012: Posters

pdf bib
A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts
Lizhen Qu | Rainer Gemulla | Gerhard Weikum
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Natural Language Questions for the Web of Data
Mohamed Yahya | Klaus Berberich | Shady Elbassuoni | Maya Ramanath | Volker Tresp | Gerhard Weikum
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
PATTY: A Taxonomy of Relational Patterns with Semantic Types
Ndapandula Nakashole | Gerhard Weikum | Fabian Suchanek
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

pdf bib
Robust Disambiguation of Named Entities in Text
Johannes Hoffart | Mohamed Amir Yosef | Ilaria Bordino | Hagen Fürstenau | Manfred Pinkal | Marc Spaniol | Bilyana Taneva | Stefan Thater | Gerhard Weikum
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2010

pdf bib
The Bag-of-Opinions Method for Review Rating Prediction from Sparse Text Patterns
Lizhen Qu | Georgiana Ifrim | Gerhard Weikum
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Untangling the Cross-Lingual Link Structure of Wikipedia
Gerard de Melo | Gerhard Weikum
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Providing Multilingual, Multimodal Answers to Lexical Database Queries
Gerard de Melo | Gerhard Weikum
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Language users are increasingly turning to electronic resources to address their lexical information needs, due to their convenience and their ability to simultaneously capture different facets of lexical knowledge in a single interface. In this paper, we discuss techniques to respond to a user's lexical queries by providing multilingual and multimodal information, and facilitating navigating along different types of links. To this end, structured information from sources like WordNet, Wikipedia, Wiktionary, as well as Web services is linked and integrated to provide a multi-faceted yet consistent response to user queries. The meanings of words in many different languages are characterized by mapping them to appropriate WordNet sense identifiers and adding multilingual gloss descriptions as well as example sentences. Relationships are derived from WordNet and Wiktionary to allow users to discover semantically related words, etymologically related words, alternative spellings, as well as misspellings. Last but not least, images, audio recordings, and geographical maps extracted from Wikipedia and Wiktionary allow for a multimodal experience.

2009

pdf bib
Extracting Sense-Disambiguated Example Sentences From Parallel Corpora
Gerard de Melo | Gerhard Weikum
Proceedings of the 1st Workshop on Definition Extraction

2008

pdf bib
Mapping Roget’s Thesaurus and WordNet to French
Gerard de Melo | Gerhard Weikum
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Roget’s Thesaurus and WordNet are very widely used lexical reference works. We describe an automatic mapping procedure that effectively produces French translations of the terms in these two resources. Our approach to the challenging task of disambiguation is based on structural statistics as well as measures of semantic relatedness that are utilized to learn a classification model for associations between entries in the thesaurus and French terms taken from bilingual dictionaries. By building and applying such models, we have produced French versions of Roget’s Thesaurus and WordNet with a considerable level of accuracy, which can be used for a variety of different purposes, by humans as well as in computational applications.

2006

pdf bib
LEILA: Learning to Extract Information by Linguistic Analysis
Fabian M. Suchanek | Georgiana Ifrim | Gerhard Weikum
Proceedings of the 2nd Workshop on Ontology Learning and Population: Bridging the Gap between Text and Knowledge