Building an English-Chinese Parallel Corpus Annotated with Sub-sentential Translation Techniques
Yuming Zhai | Lufei Liu | Xinyi Zhong | Gbariel Illouz | Anne Vilnat
Proceedings of the 12th Language Resources and Evaluation Conference
Human translators often resort to different non-literal translation techniques besides the literal translation, such as idiom equivalence, generalization, particularization, semantic modulation, etc., especially when the source and target languages have different and distant origins. Translation techniques constitute an important subject in translation studies, which help researchers to understand and analyse translated texts. However, they receive less attention in developing Natural Language Processing (NLP) applications. To fill this gap, one of our long term objectives is to have a better semantic control of extracting paraphrases from bilingual parallel corpora. Based on this goal, we suggest this hypothesis: it is possible to automatically recognize different sub-sentential translation techniques. For this original task, since there is no dedicated data set for English-Chinese, we manually annotated a parallel corpus of eleven genres. Fifty sentence pairs for each genre have been annotated in order to consolidate our annotation guidelines. Based on this data set, we conducted an experiment to classify between literal and non-literal translations. The preliminary results confirm our hypothesis. The corpus and code are available. We hope that this annotated corpus will be useful for linguistic contrastive studies and for fine-grained evaluation of NLP tasks, such as automatic word alignment and machine translation.