Boi Faltings


2020

pdf bib
HotelRec: a Novel Very Large-Scale Hotel Recommendation Dataset
Diego Antognini | Boi Faltings
Proceedings of the 12th Language Resources and Evaluation Conference

Today, recommender systems are an inevitable part of everyone’s daily digital routine and are present on most internet platforms. State-of-the-art deep learning-based models require a large number of data to achieve their best performance. Many datasets fulfilling this criterion have been proposed for multiple domains, such as Amazon products, restaurants, or beers. However, works and datasets in the hotel domain are limited: the largest hotel review dataset is below the million samples. Additionally, the hotel domain suffers from a higher data sparsity than traditional recommendation datasets and therefore, traditional collaborative-filtering approaches cannot be applied to such data. In this paper, we propose HotelRec, a very large-scale hotel recommendation dataset, based on TripAdvisor, containing 50 million reviews. To the best of our knowledge, HotelRec is the largest publicly available dataset in the hotel domain (50M versus 0.9M) and additionally, the largest recommendation dataset in a single domain and with textual reviews (50M versus 22M). We release HotelRec for further research: https://github.com/Diego999/HotelRec.

pdf bib
GameWikiSum: a Novel Large Multi-Document Summarization Dataset
Diego Antognini | Boi Faltings
Proceedings of the 12th Language Resources and Evaluation Conference

Today’s research progress in the field of multi-document summarization is obstructed by the small number of available datasets. Since the acquisition of reference summaries is costly, existing datasets contain only hundreds of samples at most, resulting in heavy reliance on hand-crafted features or necessitating additional, manually annotated data. The lack of large corpora therefore hinders the development of sophisticated models. Additionally, most publicly available multi-document summarization corpora are in the news domain, and no analogous dataset exists in the video game domain. In this paper, we propose GameWikiSum, a new domain-specific dataset for multi-document summarization, which is one hundred times larger than commonly used datasets, and in another domain than news. Input documents consist of long professional video game reviews as well as references of their gameplay sections in Wikipedia pages. We analyze the proposed dataset and show that both abstractive and extractive models can be trained on it. We release GameWikiSum for further research: https://github.com/Diego999/GameWikiSum.

pdf bib
Continual Learning for Natural Language Generation in Task-oriented Dialog Systems
Fei Mi | Liangwei Chen | Mengjie Zhao | Minlie Huang | Boi Faltings
Findings of the Association for Computational Linguistics: EMNLP 2020

Natural language generation (NLG) is an essential component of task-oriented dialog systems. Despite the recent success of neural approaches for NLG, they are typically developed in an offline manner for particular domains. To better fit real-life applications where new data come in a stream, we study NLG in a “continual learning” setting to expand its knowledge to new domains or functionalities incrementally. The major challenge towards this goal is catastrophic forgetting, meaning that a continually trained model tends to forget the knowledge it has learned before. To this end, we propose a method called ARPER (Adaptively Regularized Prioritized Exemplar Replay) by replaying prioritized historical exemplars, together with an adaptive regularization technique based on Elastic Weight Consolidation. Extensive experiments to continually learn new domains and intents are conducted on MultiWoZ-2.0 to benchmark ARPER with a wide range of techniques. Empirical results demonstrate that ARPER significantly outperforms other methods by effectively mitigating the detrimental catastrophic forgetting issue.

2019

pdf bib
Learning to Create Sentence Semantic Relation Graphs for Multi-Document Summarization
Diego Antognini | Boi Faltings
Proceedings of the 2nd Workshop on New Frontiers in Summarization

Linking facts across documents is a challenging task, as the language used to express the same information in a sentence can vary significantly, which complicates the task of multi-document summarization. Consequently, existing approaches heavily rely on hand-crafted features, which are domain-dependent and hard to craft, or additional annotated data, which is costly to gather. To overcome these limitations, we present a novel method, which makes use of two types of sentence embeddings: universal embeddings, which are trained on a large unrelated corpus, and domain-specific embeddings, which are learned during training. To this end, we develop SemSentSum, a fully data-driven model able to leverage both types of sentence embeddings by building a sentence semantic relation graph. SemSentSum achieves competitive results on two types of summary, consisting of 665 bytes and 100 words. Unlike other state-of-the-art models, neither hand-crafted features nor additional annotated data are necessary, and the method is easily adaptable for other tasks. To our knowledge, we are the first to use multiple sentence embeddings for the task of multi-document summarization.

2012

pdf bib
Sentiment Analysis Using a Novel Human Computation Game
Claudiu-Cristian Musat | Alireza Ghasemi | Boi Faltings
Proceedings of the 3rd Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources and their Applications to NLP