Conference Proceedings

Improving evaluation of document-level machine translation quality estimation

Y Graham, Q Ma, T Baldwin, Q Liu, C Parra, C Scarton

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers | Unknown | Published : 2017


© 2017 Association for Computational Linguistics. Meaningful conclusions about the relative performance of NLP systems are only possible if the gold standard employed in a given evaluation is both valid and reliable. In this paper, we explore the validity of human annotations currently employed in the evaluation of document-level quality estimation for machine translation (MT).We demonstrate the degree to which MT system rankings are dependent on weights employed in the construction of the gold standard, before proposing direct human assessment as a valid alternative. Experiments show direct assessment (DA) scores for documents to be highly reliable, achieving a correlation of above 0.9 in a..

View full abstract

Citation metrics