Michel Généreux


pdf bib
The Gulf of Guinea Creole Corpora
Tjerk Hagemeijer | Michel Généreux | Iris Hendrickx | Amália Mendes | Abigail Tiny | Armando Zamora
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present the process of building linguistic corpora of the Portuguese-related Gulf of Guinea creoles, a cluster of four historically related languages: Santome, Angolar, Principense and Fa d’Ambô. We faced the typical difficulties of languages lacking an official status, such as lack of standard spelling, language variation, lack of basic language instruments, and small data sets, which comprise data from the late 19th century to the present. In order to tackle these problems, the compiled written and transcribed spoken data collected during field work trips were adapted to a normalized spelling that was applied to the four languages. For the corpus compilation we followed corpus linguistics standards. We recorded meta data for each file and added morphosyntactic information based on a part-of-speech tag set that was designed to deal with the specificities of these languages. The corpora of three of the four creoles are already available and searchable via an online web interface.

pdf bib
A corpus of European Portuguese child and child-directed speech
Ana Lúcia Santos | Michel Généreux | Aida Cardoso | Celina Agostinho | Silvana Abalada
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using the CLAN software (MacWhinney, 2000). The corpus itself represents a valuable resource for the study of lexical, syntax and discourse acquisition. In this paper, we also show how we used an existing part-of-speech tagger trained on written material (Généreux, Hendrickx & Mendes, 2012) to automatically lemmatize and tag child and child-directed speech and generate a line with part-of-speech information compatible with the CLAN interface. We show that a POS-tagger trained on the analysis of written language can be exploited for the treatment of spoken material with minimal effort, with only a small number of written rules assisting the statistical model.


pdf bib
Introducing the Reference Corpus of Contemporary Portuguese Online
Michel Généreux | Iris Hendrickx | Amália Mendes
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present our work in processing the Reference Corpus of Contemporary Portuguese and its publication online. After discussing how the corpus was built and our choice of meta-data, we turn to the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries. The Web platform is described, and we show examples of linguistic resources that can be extracted from the platform for use in linguistic studies or in NLP.

pdf bib
Contrasting Objective and Subjective Portuguese Texts from Heterogeneous Sources
Michel Généreux | William Martinez
Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data


pdf bib
CBSEAS, a Summarization System – Integration of Opinion Mining Techniques to Summarize Blogs
Aurélien Bossard | Michel Généreux | Thierry Poibeau
Proceedings of the Demonstrations Session at EACL 2009


pdf bib
Cultural Heritage Digital Resources: From Extraction to Querying
Michel Généreux
Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007).


pdf bib
Towards a validated model for affective classification of texts
Michel Généreux | Roger Evans
Proceedings of the Workshop on Sentiment and Subjectivity in Text