The quality of machine translation produced by state-of-the-art models is already quite high and often requires only minor corrections from professional human translators. This is especially true for high-resource language pairs like English-German and English-French. So, the main focus of recent research studies in machine translation was on improving system performance for low-resource language pairs, where we have access to large monolingual corpora in each language but do not have sufficiently large parallel corpora.
Facebook AI researchers seem to lead in this research area and have introduced several interesting solutions for low-resource machine translation during the last year. This includes augmenting the training data with back-translation, learning joint multilingual sentence representations, as well as extending BERT to a cross-lingual setting.
If these accessible AI research analyses & summaries are useful for you, you can subscribe to receive our regular industry updates below.
If you’d like to skip around, here are the papers we featured:
- Phrase-Based & Neural Unsupervised Machine Translation
- Meta-Learning for Low-Resource Neural Machine Translation
- Understanding Back-Translation at Scale
- Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
- Cross-lingual Language Model Pretraining
Important Machine Translation Research Papers
1. Phrase-Based & Neural Unsupervised Machine Translation, by Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, Marc’Aurelio Ranzato
Original Abstract
Machine translation systems achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large amounts of parallel sentences, which hinders their applicability to the majority of language pairs. This work investigates how to learn to translate when having access to only large monolingual corpora in each language. We propose two model variants, a neural and a phrase-based model. Both versions leverage a careful initialization of the parameters, the denoising effect of language models and automatic generation of parallel data by iterative back-translation. These models are significantly better than methods from the literature, while being simpler and having fewer hyper-parameters. On the widely used WMT’14 English-French and WMT’16 German-English benchmarks, our models respectively obtain 28.1 and 25.2 BLEU points without using a single parallel sentence, outperforming the state of the art by more than 11 BLEU points. On low-resource languages like English-Urdu and English-Romanian, our methods achieve even better results than semi-supervised and supervised approaches leveraging the paucity of available bitexts. Our code for NMT and PBSMT is publicly available.
Our Summary
Facebook AI researchers acknowledge the lack of large parallel corpora for training machine translation systems and suggest a better way to leverage monolingual data for machine translation (MT). In particular, they argue that unsupervised MT can be successfully accomplished with suitable initialization of the translation model, language modeling and iterative back-translation. The researchers suggest two model variants, a neural and a phrase-based, and both of them outperform the state-of-the-art dramatically.
What’s the core idea of this paper?
- Unsupervised MT can be accomplished with:
- suitable initialization of the translation models (i.e., byte-pair encodings);
- training language models in both source and target languages for improving the quality of translation models (e.g., performing local substitutions, word reordering);
- iterative back-translation for automatic generation of parallel data.
- There are two model variants, neural and phrase-based:
- Neural machine translation has an additional important property – sharing of internal representations across languages.
- Phrase-based machine translation outperforms neural models on low-resource language pairs, is easy to interpret and fast to train.
What’s the key achievement?
- Neural and phrase-based machine translation models remarkably outperform previous unsupervised baselines, for example:
- for English-French task, phrase-based translation model obtains a BLEU score of 28.1 (+11 BLEU points over the previous best result);
- for German-English task, neural and phrase-based translation models combined get a BLEU score of 25.2 (+ 10 BLEU points over the baseline).
- The unsupervised phrase-based translation model achieves the same performance as its supervised counterpart trained on more than 100,000 parallel sentences.
What does the AI community think?
- The paper was awarded the Best Paper Award at EMNLP 2018, a leading conference in the area of natural language processing.
- “We could go now on a planet where people speak a language that nobody else speaks — and you can actually go and try to have a decent translation of what is said there,” Antoine Bordes, FAIR Paris Lab director, said to VentureBeat.
What are future research areas?
- Searching for more effective instantiations of the suggested principles or other principles altogether.
- Extending to semi-supervised settings.
What are possible business applications?
- Improving the results of machine translation for language pairs where there are not enough parallel corpora to train supervised machine translation systems.
Where can you get implementation code?
- The Facebook team provides the original implementation for this research paper on GitHub.
2. Meta-Learning for Low-Resource Neural Machine Translation, by Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, Victor O.K. Li
Original Abstract
In this paper, we propose to extend the recently introduced model-agnostic meta-learning algorithm (MAML) for low-resource neural machine translation (NMT). We frame low-resource translation as a meta-learning problem, and we learn to adapt to low-resource languages based on multilingual high-resource language tasks. We use the universal lexical representation to overcome the input-output mismatch across different languages. We evaluate the proposed meta-learning strategy using eighteen European languages (Bg, Cs, Da, De, El, Es, Et, Fr, Hu, It, Lt, Nl, Pl, Pt, Sk, Sl, Sv and Ru) as source tasks and five diverse languages (Ro, Lv, Fi, Tr and Ko) as target tasks. We show that the proposed approach significantly outperforms the multilingual, transfer learning based approach and enables us to train a competitive NMT system with only a fraction of training examples. For instance, the proposed approach can achieve as high as 22.04 BLEU on Romanian-English WMT’16 by seeing only 16,000 translated words (~600 parallel sentences).
Our Summary
Researchers from the University of Hong Kong and New York University use a model-agnostic meta-learning algorithm (MAML) to solve the problem of low-resource machine translation. In particular, they suggest using many high-resource language pairs to find the initial parameters of the model. This initialization allows then to train a new language model on a low-resource language pair using only a few steps of learning. For example, the model initialized using 18 high-resource language pairs, was able to achieve the BLEU score of 22.04 on the new language pair by seeing only around 600 parallel sentences.
What’s the core idea of this paper?
- The paper introduces a new meta-learning method, MetaNMT, which assumes using many high-resource language pairs to find good initial parameters and then training a new translation model on a low-resource language starting from the found initial parameters.
- Meta-learning can be applied to low-resource machine translation only if the input and output spaces are shared across all the source and target tasks. However, this is generally not the case since different languages have different vocabularies. To tackle this issue, the researchers dynamically build a vocabulary specific to each language using a key-value memory network.
What’s the key achievement?
- Suggesting a new approach to Neural Machine Translation for extremely low resource languages, which:
- enables sharing the information between high-resource and extremely low-resource language pairs;
- uses only a few thousands of sentences for fine-tuning a new translation model on a low-resource language pair;
- fine-tunes for a new language in a couple of minutes.
- The experiments demonstrate that:
- Meta-learning consistently performs better than multilingual transfer learning.
- The choice of language pair for a validation set for meta-learning impacts the performance of the resulting model. For example, Finnish-English benefits more when Romanian-English is used for validation, while Turkish-English prefers validation on Latvian-English.
What does the AI community think?
- The paper was presented at EMNLP 2018, a leading conference in the area of natural language processing.
- The presented approach got a Facebook award for the Low-resource Neural Machine Translation.
What are future research areas?
- Meta-learning for semi-supervised Neural Machine Translation, or learning to learn from monolingual corpora.
- Multi-modal meta-learning, when multiple meta-models are learned and a new language can freely choose a model to adapt from.
What are possible business applications?
- MetaNMT can be used to improve the results of machine translation for language pairs where the available parallel corpora are extremely small.
3. Understanding Back-Translation at Scale, by Sergey Edunov, Myle Ott, Michael Auli, David Grangier
Original Abstract
An effective method to improve neural machine translation with monolingual data is to augment the parallel training corpus with back-translations of target language sentences. This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences. We find that in all but resource-poor settings back-translations obtained via sampling or noised beam outputs are most effective. Our analysis shows that sampling or noisy synthetic data gives a much stronger training signal than data generated by beam or greedy search. We also compare how synthetic data compares to genuine bitext and study various domain effects. Finally, we scale to hundreds of millions of monolingual sentences and achieve a new state of the art of 35 BLEU on the WMT’14 English-German test set.
Our summary
This paper from the Facebook AI research team investigates back translation for neural machine translation at a large scale. In fact, they augment the parallel training corpus with hundreds of millions of back-translated sentences. A comprehensive analysis of different methods to generate synthetic source sentences shows that synthetic data based on sampling and noised beam search provides the strongest training signal. The experiments demonstrate that Big Transformer architecture combined with back translation achieves state-of-the-art results on WMT’14 English-French and WMT’14 English-German datasets with 45.6 BLEU and 35 BLEU respectively.
What’s the core idea of this paper?
- Augmenting the parallel training corpus with back-translations of target language sentences:
- training an intermediate target-to-source system on the parallel data;
- using this system to translate the target monolingual data into the source language;
- getting additional parallel corpus for training source-to-target systems where the source side is synthetic machine translation output and the target is a real text written by humans.
- Generating synthetic source sentences by sampling from the model distribution as well as adding noise to beam search outputs:
- Back-translation typically use beam or greedy search, where the sentence with the largest estimated probability is chosen as an output. This often results in very regular synthetic sentences that do not properly cover the true data distribution.
- Sampling from the model distribution or noising beam outputs outperforms pure beam search by 1.7 BLEU on average.
What’s the key achievement?
- The best setup presented in the paper is the current state of the art for WMT’14 English-French and WMT’14 English-German datasets with 45.6 BLEU and 35 BLEU respectively.
- The experiments also show that synthetic data can achieve up to 83% of the performance attainable with real bitext.
What does the AI community think?
- The paper was presented at EMNLP 2018, a leading conference in the area of natural language processing.
What are future research areas?
- Investigating an end-to-end approach where the back-translation model is optimized to output synthetic sources that are most helpful to the final forward model.
What are possible business applications?
- Translations generated by this neural machine translation system for English-French and English-German language pairs may be already sufficiently good to be deployed in the business setting.
Where can you get implementation code?
- A reference implementation of the introduced model is available as part of the Fairseq toolkit.
4. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond, by Mikel Artetxe and Holger Schwenk
Original Abstract
We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different language families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting sentence embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our approach sets a new state-of-the-art on zero-shot cross-lingual natural language inference for all the 14 languages in the XNLI dataset but one. We also achieve very competitive results in cross-lingual document classification (MLDoc dataset). Our sentence embeddings are also strong at parallel corpus mining, establishing a new state-of-the-art in the BUCC shared task for 3 of its 4 language pairs. Finally, we introduce a new test set of aligned sentences in 122 languages based on the Tatoeba corpus, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages. Our PyTorch implementation, pre-trained encoder and the multilingual test set will be freely available.
Our summary
This is yet another research paper from the Facebook AI research team in a multilingual domain. In this paper, the researchers introduce a new architecture that learns joint multilingual sentence representations. The system is based on a single language agnostic BiLSTM encoder with a shared vocabulary for 93 languages. The suggested approach establishes a new state-of-the-art for most of the languages on several multilingual tasks including zero-shot cross-lingual transfer, cross-lingual document classification, and bitext mining.
What’s the core idea of this paper?
- Using a single language agnostic BiLSTM encoder coupled with an auxiliary decoder and trained over parallel corpora:
- building a joint byte-pair encoding (BPE) vocabulary for all languages, and thus, making the encoder unaware of the input language, and encouraging it to learn language independent representations;
- in contrast, providing the decoder with language ID embedding that specifies the language to generate.
- Using two target languages, English and Spanish, and relaxing the requirement for each source sentence to be translated into the two target languages.
What’s the key achievement?
- Establishing a new state of the art on zero-shot cross-lingual natural language inference for all languages but Spanish, and thus, outperforming the multilingual BERT model in a zero-shot transfer.
- Getting also state-of-the-art results for most languages in:
- cross-lingual document classification (state of the art for 5 of the 7 language transfers);
- bitext mining (best result for 3 out of 4 language pairs).
- Introducing a new test set of aligned sentences in 122 languages.
What does the AI community think?
- “Like the fabled tower of Babel, AI researchers have for years sought a mathematical representation that would encapsulate all natural language. They’re getting closer.”, – ZDNet in their review of this research paper.
What are future research areas?
- Exploring alternative architectures for the encoder, such as for example, replacing BiLSTM with the Transformer.
- Exploiting monolingual training data in addition to parallel corpora using pre-trained word embeddings, back-translation or other strategies from unsupervised machine translation.
- Replacing language-specific tokenization and BPE segmentation with a language agnostic approach.
What are possible business applications?
- A single system based on universal sentence encoding can support hundreds of languages while being data and time efficient. This makes it a good candidate for deploying in a business setting, where lots of different languages are involved.
Where can you get implementation code?
- LASER library contains the PyTorch implementation of multilingual sentence embeddings.
5. Cross-lingual Language Model Pretraining, by Guillaume Lample and Alexis Conneau
Original Abstract
Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we obtain 34.3 BLEU on WMT’16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT’16 Romanian-English, outperforming the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.
Our summary
While the approach suggested by Artetxe and Schwenk (2018) relies on over 200 million of parallel sentences, this research paper by another Facebook AI research team suggests a method that doesn’t require so large parallel training corpora. In particular, they propose to learn cross-lingual language models with (1) an unsupervised method that relies on monolingual data only, and (2) a new supervised learning approach that can be used when parallel data is available. The experiments show that these approaches significantly outperform the previous state of the art in cross-lingual classification (+ 4.9% accuracy), unsupervised machine translation (+9.1 BLEU on WMT’16 German-English) and supervised machine translation (+4.6 BLEU on WMT’16 Romanian-English).
What’s the core idea of this paper?
- Introducing Masked Language Modeling (MLM) that is based on the BERT model and relies on monolingual data only.
- Introducing Translated Language Modeling (TLM) that is an extension of MLM for the case when parallel data is available:
- TLM leverages concatenated parallel sentences instead of monolingual text streams.
- The words are randomly masked in both source and target sentences, and thus, the model can use the context from one language to predict tokens in the other.
What’s the key achievement?
- Demonstrating that cross-lingual language models are very good for initialization of sentence encoders for zero-shot cross-lingual classification, supervised and unsupervised machine translation.
- Establishing a new state of the art in:
- cross-lingual classification with a 4.9% gain in accuracy;
- unsupervised machine translation by improving the previous best result by 9.1 BLEU on WMT’16 German-English;
- supervised machine translation by outperforming the previous state-of-the-art result by 4.6 BLEU on WMT’16 Romanian-English.
What does the AI community think?
- The paper got positive feedback from the AI community as the BERT extension to the cross-lingual setting with very impressive results.
What are possible business applications?
- Cross-lingual language modeling improves the quality of translation for languages where parallel training data is available but what’s even more important – it provides a very promising solution for low-resource language pairs. This makes the suggested approach a good candidate for deploying in certain business settings, where lots of different languages are involved.
Where can you get implementation code?
- Original PyTorch implementation of this research paper, as well as pre-trained cross-lingual language models, are made publicly available on GitHub.
Did you enjoy this AI research analyses and summary? You can read our longer summary of 14 top Natural Language Processing (NLP) research papers from 2018.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.