[关键词]
[摘要]
机器翻译是利用计算机将一种自然语言转换成另一种自然语言的任务,是人工智能领域研究的热点问题之一.近年来,随着深度学习的发展,基于序列到序列结构的神经机器翻译模型在多种语言对的翻译任务上都取得了超过统计机器翻译模型的效果,并被广泛应用于商用翻译系统中.虽然商用翻译系统的实际应用效果直观表明了神经机器翻译模型性能有很大的提升,但如何系统地评估其翻译质量仍是一项具有挑战性的工作.一方面,若基于参考译文评估翻译效果,其高质量参考译文的获取成本非常高;另一方面,与统计机器翻译模型相比,神经机器翻译模型存在更显著的鲁棒性问题,然而还没有探讨神经机器翻译模型鲁棒性的相关研究.面对上述挑战,提出了一种基于蜕变测试的多粒度测试框架,用于在没有参考译文的情况下评估神经机器翻译系统的翻译质量及其翻译鲁棒性.该测试框架首先在句子粒度、短语粒度和单词粒度上分别对源语句进行替换,然后将源语句和替换后语句的翻译结果进行基于编辑距离和成分结构分析树的相似度计算,最后根据相似度判断翻译结果是否满足蜕变关系.分别在教育、微博、新闻、口语和字幕这5个领域的中英文数据集上对6个主流商用神经机器翻译系统使用不同的蜕变测试框架进行了对比实验.实验结果表明,所提方法在与基于参考译文方法的皮尔逊相关系数和斯皮尔曼相关系数上分别比同类型方法高80%和20%,说明提出的无参考译文的测试评估方法与基于参考译文的评估方法的正相关性更高,验证了其在评估准确性上显著优于同类型其他方法.
[Key word]
[Abstract]
Machine translation task focuses on converting one natural language into another. In recent years, neural machine translation models based on sequence-to-sequence models have achieved better performance than traditional statistical machine translation models on multiple language pairs, and have been used by many translation service providers. Although the practical application of commercial translation system shows that the neural machine translation model has great improvement, how to systematically evaluate its translation quality is still a challenging task. On the one hand, if the translation effect is evaluated based on the reference text, the acquisition cost of high-quality reference text is very high. On the other hand, compared with the statistical machine translation model, the neural machine translation model has more significant robustness problems. However, there are no relevant studies on the robustness of the neural machine translation model. This study proposes a multi-granularity test framework MGMT based on metamorphic testing, which can evaluate the robustness of neural machine translation systems without reference translations. The testing framework first replaces the source sentence on sentence-granularity, phrase-granularity, and word-granularity respectively, then compares the translation results of the source sentence and the replaced sentences based on the constituency parse tree, and finally judges whether the result satisfies the metamorphic relationship. The experiments are conducted on multi-field Chinese-English translation datasets and six industrial neural machine translation systems are evaluated, and compared with same type of metamorphic testing and methods based on reference translations. The experimental results show that the proposed method MGMT is 80% and 20% higher than similar methods in terms of Pearson's correlation coefficient and Spearman's correlation coefficient respectively. This indicates that the non-reference translation evaluation method proposed in this study has a higher positive correlation with the reference translation based evaluation method, which verifies that MGMT's evaluation accuracy is significantly better than other methods of the same type.
[中图分类号]
TP311
[基金项目]
国家自然科学基金(61802167,61972197,61802095);江苏省自然科学基金(BK20201250)