Predicting Multidimensional Subjective Ratings of Children’ Readings from the Speech Signals for the Automatic Assessment of Fluency
Gérard Bailly | Erika Godde | Anne-Laure Piat-Marchand | Marie-Line Bosse
Proceedings of the 12th Language Resources and Evaluation Conference

The objective of this research is to estimate multidimensional subjective ratings of the reading performance of young readers from signal-based objective measures. We here combine linguistic features (number of correct words, repetitions, deletions, insertions uttered per minute . . . ) with phonetic features. Expressivity is particularly difficult to predict since there is no unique golden standard. We here propose a novel framework for performing such an estimation that exploits multiple references performed by adults and demonstrate its efficiency using recordings of 273 pupils.


Vizart3D : Retour Articulatoire Visuel pour l’Aide à la Prononciation (Vizart3D: Visual Articulatory Feedack for Computer-Assisted Pronunciation Training) [in French]
Thomas Hueber | Atef Ben-Youssef | Pierre Badin | Gérard Bailly | Frédéric Eliséi
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 5: Software Demonstrations


Does a Virtual Talking Face Generate Proper Multimodal Cues to Draw User’s Attention to Points of Interest?
Stephan Raidt | Gérard Bailly | Frederic Elisei
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We present a series of experiments investigating face-to-face interaction between an Embodied Conversational Agent (ECA) and a human interlocutor. The ECA is embodied by a video realistic talking head with independent head and eye movements. For a beneficial application in face-to-face interaction, the ECA should be able to derive meaning from communicational gestures of a human interlocutor, and likewise to reproduce such gestures. Conveying its capability to interpret human behaviour, the system encourages the interlocutor to show appropriate natural activity. Therefore it is important that the ECA knows how to display what would correspond to mental states in humans. This allows to interpret the machine processes of the system in terms of human expressiveness and to assign them a corresponding meaning. Thus the system may maintain an interaction based on human patterns. During a first experiment we investigated the ability of our talking head to direct user attention with facial deictic cues (Raidt, Bailly et al. 2005). Users interact with the ECA during a simple card game offering different levels of help and guidance through facial deictic cues. We analyzed the users’ performance and their perception of the quality of assistance given by the ECA. The experiment showed that users profit from its presence and its facial deictic cues. In the continuative series of experiments presented here, we investigated the effect of an enhancement of the multimodality of the deictic gestures by adding a spoken instruction.

A joint intelligibility evaluation of French text-to-speech synthesis systems: the EvaSy SUS/ACR campaign
Philippe Boula de Mareüil | Christophe d’Alessandro | Alexander Raake | Gérard Bailly | Marie-Neige Garcia | Michel Morel
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The EVALDA/EvaSy project is dedicated to the evaluation of text-to-speech synthesis systems for the French language. It is subdivided into four components: evaluation of the grapheme-to-phoneme conversion module (Boula de Mareüil et al., 2005), evaluation of prosody (Garcia et al., 2006), evaluation of intelligibility, and global evaluation of the quality of the synthesised speech. This paper reports on the key results of the intelligibility and global evaluation of the synthesised speech. It focuses on intelligibility, assessed on the basis of semantically unpredictable sentences, but a comparison with absolute category rating in terms of e.g. pleasantness and naturalness is also provided. Three diphone systems and three selection systems have been evaluated. It turns out that the most intelligible system (diphone-based) is far from being the one which obtains the best mean opinion score.

A joint prosody evaluation of French text-to-speech synthesis systems
Marie-Neige Garcia | Christophe d’Alessandro | Gérard Bailly | Philippe Boula de Mareüil | Michel Morel
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper reports on prosodic evaluation in the framework of the EVALDA/EvaSy project for text-to-speech (TTS) evaluation for the French language. Prosody is evaluated using a prosodic transplantation paradigm. Intonation contours generated by the synthesis systems are transplanted on a common segmental content. Both diphone based synthesis and natural speech are used. Five TTS systems are tested along with natural voice. The test is a paired preference test (with 19 subjects), using 7 sentences. The results indicate that natural speech obtains consistently the first rank (with an average preference rate of 80%), followed by a selection based system (72%) and a diphone based system (58%). However, rather large variations in judgements are observed among subjects and sentences, and in some cases synthetic speech is preferred to natural speech. These results show the remarkable improvement achieved by the best selection based synthesis systems in terms of prosody. In this way; a new paradigm for evaluation of the prosodic component of TTS systems has been successfully demonstrated.


Evaluation of a Speech Cuer: From Motion Capture to a Concatenative Text-to-cued Speech System
Guillaume Gibert | Gérard Bailly | Frédéric Eliséi | Denis Beautemps | Rémi Brun
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)


The Cost258 Signal Generation Test Array
Gérard Bailly | Eduardo R. Banga | Alex Monaghan | Erhard Rank
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)