Claudiu Musat

Also published as: Claudiu-Cristian Musat


2020

pdf bib
A Swiss German Dictionary: Variation in Speech and Writing
Larissa Schmidt | Lucy Linder | Sandra Djambazovska | Alexandros Lazaridis | Tanja Samardžić | Claudiu Musat
Proceedings of the 12th Language Resources and Evaluation Conference

We introduce a dictionary containing normalized forms of common words in various Swiss German dialects into High German. As Swiss German is, for now, a predominantly spoken language, there is a significant variation in the written forms, even between speakers of the same dialect. To alleviate the uncertainty associated with this diversity, we complement the pairs of Swiss German - High German words with the Swiss German phonetic transcriptions (SAMPA). This dictionary becomes thus the first resource to combine large-scale spontaneous translation with phonetic transcriptions. Moreover, we control for the regional distribution and insure the equal representation of the major Swiss dialects. The coupling of the phonetic and written Swiss German forms is powerful. We show that they are sufficient to train a Transformer-based phoneme to grapheme model that generates credible novel Swiss German writings. In addition, we show that the inverse mapping - from graphemes to phonemes - can be modeled with a transformer trained with the novel dictionary. This generation of pronunciations for previously unknown words is key in training extensible automated speech recognition (ASR) systems, which are key beneficiaries of this dictionary.

2019

pdf bib
Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes
Noémien Kocher | Christian Scuito | Lorenzo Tarantino | Alexandros Lazaridis | Andreas Fischer | Claudiu Musat
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

In sequence modeling tasks the token order matters, but this information can be partially lost due to the discretization of the sequence into data points. In this paper, we study the imbalance between the way certain token pairs are included in data points and others are not. We denote this a token order imbalance (TOI) and we link the partial sequence information loss to a diminished performance of the system as a whole, both in text and speech processing tasks. We then provide a mechanism to leverage the full token order information—Alleviated TOI—by iteratively overlapping the token composition of data points. For recurrent networks, we use prime numbers for the batch size to avoid redundancies when building batches from overlapped data points. The proposed method achieved state of the art performance in both text and speech related tasks.

2018

pdf bib
Machine Translation of Low-Resource Spoken Dialects: Strategies for Normalizing Swiss German
Pierre-Edouard Honnet | Andrei Popescu-Belis | Claudiu Musat | Michael Baeriswyl
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Churn Intent Detection in Multilingual Chatbot Conversations and Social Media
Christian Abbet | Meryem M’hamdi | Athanasios Giannakopoulos | Robert West | Andreea Hossmann | Michael Baeriswyl | Claudiu Musat
Proceedings of the 22nd Conference on Computational Natural Language Learning

We propose a new method to detect when users express the intent to leave a service, also known as churn. While previous work focuses solely on social media, we show that this intent can be detected in chatbot conversations. As companies increasingly rely on chatbots they need an overview of potentially churny users. To this end, we crowdsource and publish a dataset of churn intent expressions in chatbot interactions in German and English. We show that classifiers trained on social media data can detect the same intent in the context of chatbots. We introduce a classification architecture that outperforms existing work on churn intent detection in social media. Moreover, we show that, using bilingual word embeddings, a system trained on combined English and German data outperforms monolingual approaches. As the only existing dataset is in English, we crowdsource and publish a novel dataset of German tweets. We thus underline the universal aspect of the problem, as examples of churn intent in English help us identify churn in German tweets and chatbot conversations.

pdf bib
Simple Unsupervised Keyphrase Extraction using Sentence Embeddings
Kamil Bennani-Smires | Claudiu Musat | Andreea Hossmann | Michael Baeriswyl | Martin Jaggi
Proceedings of the 22nd Conference on Computational Natural Language Learning

Keyphrase extraction is the task of automatically selecting a small set of phrases that best describe a given free text document. Supervised keyphrase extraction requires large amounts of labeled training data and generalizes very poorly outside the domain of the training data. At the same time, unsupervised systems have poor accuracy, and often do not generalize well, as they require the input document to belong to a larger corpus also given as input. Addressing these drawbacks, in this paper, we tackle keyphrase extraction from single documents with EmbedRank: a novel unsupervised method, that leverages sentence embeddings. EmbedRank achieves higher F-scores than graph-based state of the art systems on standard datasets and is suitable for real-time processing of large amounts of Web data. With EmbedRank, we also explicitly increase coverage and diversity among the selected keyphrases by introducing an embedding-based maximal marginal relevance (MMR) for new phrases. A user study including over 200 votes showed that, although reducing the phrases’ semantic overlap leads to no gains in F-score, our high diversity selection is preferred by humans.

pdf bib
Embedding Individual Table Columns for Resilient SQL Chatbots
Bojan Petrovski | Ignacio Aguado | Andreea Hossmann | Michael Baeriswyl | Claudiu Musat
Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI

Most of the world’s data is stored in relational databases. Accessing these requires specialized knowledge of the Structured Query Language (SQL), putting them out of the reach of many people. A recent research thread in Natural Language Processing (NLP) aims to alleviate this problem by automatically translating natural language questions into SQL queries. While the proposed solutions are a great start, they lack robustness and do not easily generalize: the methods require high quality descriptions of the database table columns, and the most widely used training dataset, WikiSQL, is heavily biased towards using those descriptions as part of the questions. In this work, we propose solutions to both problems: we entirely eliminate the need for column descriptions, by relying solely on their contents, and we augment the WikiSQL dataset by paraphrasing column names to reduce bias. We show that the accuracy of existing methods drops when trained on our augmented, column-agnostic dataset, and that our own method reaches state of the art accuracy, while relying on column contents only.

2017

pdf bib
Unsupervised Aspect Term Extraction with B-LSTM & CRF using Automatically Labelled Datasets
Athanasios Giannakopoulos | Claudiu Musat | Andreea Hossmann | Michael Baeriswyl
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Aspect Term Extraction (ATE) identifies opinionated aspect terms in texts and is one of the tasks in the SemEval Aspect Based Sentiment Analysis (ABSA) contest. The small amount of available datasets for supervised ATE and the costly human annotation for aspect term labelling give rise to the need for unsupervised ATE. In this paper, we introduce an architecture that achieves top-ranking performance for supervised ATE. Moreover, it can be used efficiently as feature extractor and classifier for unsupervised ATE. Our second contribution is a method to automatically construct datasets for ATE. We train a classifier on our automatically labelled datasets and evaluate it on the human annotated SemEval ABSA test sets. Compared to a strong rule-based baseline, we obtain a dramatically higher F-score and attain precision values above 80%. Our unsupervised method beats the supervised ABSA baseline from SemEval, while preserving high precision scores.

2013

pdf bib
Fine-Grained Emotion Recognition in Olympic Tweets Based on Human Computation
Valentina Sintsova | Claudiu Musat | Pearl Pu
Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

2012

pdf bib
Sentiment Analysis Using a Novel Human Computation Game
Claudiu-Cristian Musat | Alireza Ghasemi | Boi Faltings
Proceedings of the 3rd Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources and their Applications to NLP