神经机器翻译WMT14英法基准系统 WMT14 English-French Baseline

最新推荐文章于 2026-04-29 09:50:39 发布

原创

最新推荐文章于 2026-04-29 09:50:39 发布 · 5.6k 阅读

本文回顾了2017年以来WMT14英法翻译基准系统的进展，包括GNMT的32K wordpieces模型，Transformer的基线和大模型，RNMT+，ConvS2S以及Fairseq。各模型使用不同的词汇处理，如wordpieces和BPE，实验结果显示Fairseq在WMT'14上取得了43.2的高分。

最近（2017年以来）的WMT14 English-French Baseline记录

1. GNMT

https://arxiv.org/pdf/1609.08144.pdf

语料处理：a shared source and target vocabulary of 32K wordpieces

For the wordpiece models, we train 3 different models with vocabulary sizes of 8K, 16K, and 32K. Table 4 summarizes our results on the WMT En→Fr dataset. In this table, we also compare against other strong baselines without model ensembling. As can be seen from the table, “WPM-32K”, a wordpiece model with a shared source and target vocabulary of 32K wordpieces, performs well on this dataset and achieves the best quality as well as the fastest inference speed.

On WMT En→Fr, the training set contains 36M sentence pairs. In both cases, we use newstest2014 as the test sets to compare against previous work. The combination of newstest2012 and