Claudiu Cristian Musat


2020

pdf bib
Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German
Lucy Linder | Michael Jungo | Jean Hennebert | Claudiu Cristian Musat | Andreas Fischer
Proceedings of the 12th Language Resources and Evaluation Conference

This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than half a million sentences, it was generated using a customized web scraping tool that could be applied to other low-resource languages as well. The approach demonstrates how freely available web pages can be used to construct comprehensive text corpora, which are of fundamental importance for natural language processing. In an experimental evaluation, we show that using the new corpus leads to significant improvements for the task of language modeling.

pdf bib
Control, Generate, Augment: A Scalable Framework for Multi-Attribute Text Generation
Giuseppe Russo | Nora Hollenstein | Claudiu Cristian Musat | Ce Zhang
Findings of the Association for Computational Linguistics: EMNLP 2020

We introduce CGA, a conditional VAE architecture, to control, generate, and augment text. CGA is able to generate natural English sentences controlling multiple semantic and syntactic attributes by combining adversarial learning with a context-aware loss and a cyclical word dropout routine. We demonstrate the value of the individual model components in an ablation study. The scalability of our approach is ensured through a single discriminator, independently of the number of attributes. We show high quality, diversity and attribute control in the generated sentences through a series of automatic and human assessments. As the main application of our work, we test the potential of this new NLG model in a data augmentation scenario. In a downstream NLP task, the sentences generated by our CGA model show significant improvements over a strong baseline, and a classification performance often comparable to adding same amount of additional real data.