Grégoire Winterstein


CantoMap: a Hong Kong Cantonese MapTask Corpus
Grégoire Winterstein | Carmen Tang | Regine Lai
Proceedings of the 12th Language Resources and Evaluation Conference

This work reports on the construction of a corpus of connected spoken Hong Kong Cantonese. The corpus aims at providing an additional resource for the study of modern (Hong Kong) Cantonese and also involves several controlled elicitation tasks which will serve different projects related to the phonology and semantics of Cantonese. The word-segmented corpus offers recordings, phonemic transcription, and Chinese characters transcription. The corpus contains a total of 768 minutes of recordings and transcripts of forty speakers. All the audio material has been aligned at utterance level with the transcriptions, using the ELAN transcription and annotation tool. The controlled elicitation task was based on the design of HCRC MapTask corpus (Anderson et al., 1991), in which participants had to communicate using solely verbal means as eye contact was restricted. In this paper, we outline the design of the maps and their landmarks and the basic segmentation principles of the data and various transcription conventions we adopted. We also compare the contents of Cantomap to those of comparable Cantonese corpora.

Cifu: a Frequency Lexicon of Hong Kong Cantonese
Regine Lai | Grégoire Winterstein
Proceedings of the 12th Language Resources and Evaluation Conference

This paper introduces Cifu, a lexical database for Hong Kong Cantonese (HKC) that offers phonological and orthographic information, frequency measures, and lexical neighborhood information for lexical items in HKC. Cifu is of use for NLP applications and the design and analysis of psycholinguistics experiments on HKC. We elaborate on the characteristics and challenges specific to HKC that were relevant in the design of Cifu. This includes lexical, orthographic and phonological aspects of HKC, word segmentation issues, the place of HKC in written media, and the availability of data. We discuss the measure of Neighborhood Density (ND), highlighting how the analytic nature of Cantonese and its writing system affect that measure. We justify using six different variations of ND, based on the possibility of inserting or deleting phonemes when searching for neighbors and on the choice of data for retrieving frequencies. Statistics about the four genres (written, adult spoken, children spoken and child-directed) within the dataset are discussed. We find that the lexical diversity of the child-directed speech genre is particularly low, compared to a size-matched written corpus. The correlations of word frequencies of different genres are all high, but in generally decrease as word length increases.


Minoan linguistic resources: The Linear A Digital Corpus
Tommaso Petrolito | Ruggero Petrolito | Francesco Perono Cacciafoco | Grégoire Winterstein
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)


Building and exploiting a French corpus for sentiment analysis (Construction et exploitation d’un corpus français pour l’analyse de sentiment) [in French]
Marc Vincent | Grégoire Winterstein
Proceedings of TALN 2013 (Volume 2: Short Papers)