Machine translation with Hugging Face

Following our article on text and token classification in NLP, our Hugging Face series continues here.

They say variety is the spice of life, and human languages are an excellent expression of that diversity. Language goes beyond words. For example, idioms and their evolution can be a measure of the genius of the community behind them.

With so many languages, one may wonder at the meaning of words written in a foreign tongue. But even if we restrict ourselves to some major languages like Arabic, Chinese, French, English, Persian, Spanish, and Urdu (these 7 languages are good enough to cover the majority of the world’s population), it's not easy to learn so many languages (or even a subset of them).

As a result, language translation (and translators) have always been in huge demand. Be it the translation of documents, talking with a foreign delegate, intelligence services, or just reading foreign literature; we require their services.

Machine translation

Due to the sparsity and limited human nature of these translators, people have long wondered if computers can be used for the purpose. With the advent of Artificial Intelligence in the 20th century, there was an over-optimistic interest (especially from government agencies) in machine translation. Not realizing the challenges of this huge task, over-expectation led to snappy decisions from governments, including the first AI winter.

With the recent advancement of sequence models and the rise of transformers, machine translation is achieving incredible accuracy. And thanks to the availability of a number of models, we can make our own translator system rather than entirely relying on the translator systems of Bing or Google.

💡

Web scraping for machine translation is bridging language barriers

Hugging Face translation pipeline

For translators, we can import the pipeline and then specify the translator as: translation_<source language>_to_<destination language>"

For example, from English to French, we can specify it as follows:

!pip install sentencepiece
!pip install transformers datasets

import sentencepiece
from transformers import pipeline

frenchTranslator = pipeline("translation_en_to_fr")

Support for other languages

By default, the translation pipeline uses t5-base, and inevitably, it doesn’t support some of the languages. For example, if I try a Turkish translator, it will throw an error:

from transformers import pipeline
turkishTranslator = pipeline("translation_en_to_tr")

ValueError: The task does not provide any default models for options ('en', 'tr')

Fret not. In case it doesn’t support a particular language combination, we can switch to some other model.

That's what I did. I found a relevant model, and now I'm going to try it here:

turkishTranslator = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-tr")
turkishTranslator("Byzantine empire was a jewel of the world.")

# Output: [{'translation_text': 'Bizans İmparatorluğu dünyanın bir mücevheriydi.'}]

Note: Often importing a model throws a sentencepiece error so it is imported at the start to avoid any error. In case you face such an error, make sure to install it and rerun the notebook after restarting the Python kernel.

Other models

While there's no limit to the models available (3,000+ as of writing this in December 2023), I'd like to highlight a couple of them:

T5-large
FAIR WMT19 Models

Datasets

I briefly touched upon the importance of fine-tuning and Hugging Face’s respective support, but Hugging Face doesn’t stop there. It provides us with some high-quality datasets (all provided by the community). For example:

arXiV - contains more than 1.7 million arXiV submissions.
Opus-100 - contains pairs of source and destination language sentences. It's a huge dataset with over 55M pairs of 100+ languages (hence the 100 in the name). English is the reference language here and is used (either as a source or destination) in every pair.

Examples

With that introduction to the translation pipeline and respective models, let’s test them out for different languages.

We'll take article 1 from the Universal Declaration of Human Rights and try to translate it into different languages.

We already have French and Turkish translators, so let's use them first.

originalText = "All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood."

frenchTranslation = frenchTranslator(originalText)
turkishTranslation = turkishTranslator(originalText)

print(frenchTranslation)
# French Output: [{'translation_text': 'Tous les êtres humains naissent libres et égaux en dignité et en droits; ils sont dotés de la raison et de la conscience et doivent agir les uns envers les autres dans un esprit de fraternité.'}] 

print(turkishTranslation)
# Turkish Output: [{'translation_text': 'Bütün insanlar, haysiyet ve haklar bakımından özgür ve eşit doğarlar. Akıl ve vicdanla donatılırlar ve kardeşlik ruhu içinde birbirlerine karşı hareket etmelidirler.'}]

Out of curiosity, I tested the quality of the translation by comparing it with the Bing translator. And here's how Bing translated it:

French: Tous les êtres humains naissent libres et égaux en dignité et en droits. Ils sont doués de raison et de conscience et doivent agir les uns envers les autres dans un esprit de fraternité.

Turkish: Bütün insanlar özgür, onur ve haklar bakımından eşit doğarlar. Akıl ve vicdanla donatılmışlardır ve birbirlerine karşı kardeşlik ruhuyla hareket etmelidirler.

I'm no polyglot, but I can still compare the respective translations and see that the Hugging Face models have done pretty well there. We don’t even need cosine similarity to check how similar they are.

Let’s try it for some other languages. It would be wise to delete the used models (if not required anymore):

del frenchTranslator
del turkishTranslator

spanishTranslator = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-es")
spanishTranslator(originalText)

# Output: [{'translation_text': 'Todos los seres humanos nacen libres e iguales en dignidad y derechos; están dotados de razón y conciencia y deben actuar unos con otros en un espíritu de hermandad.'}]

Its respective translation using Bing is:

Todos los seres humanos nacen libres e iguales en dignidad y derechos. Están dotados de razón y conciencia y deben comportarse fraternalmente los unos con los otros.

Translation from other languages

The performance in English translation can hardly come as a surprise to anyone. So I would like to try some other language - say French.

notreDamedeParisQuote = "C’est que l’amour est comme un arbre, il pousse de lui-même, jette profondément ses racines dans tout notre être, et continue souvent de verdoyer sur un cœur en ruine."

french_en_Translator = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-fr-en")
french_en_Translator(notreDamedeParisQuote)

# Output: [{'translation_text': 'Love is like a tree, it grows on its own, it has deep roots in our whole being, and it often continues to grow green on a heart in ruins.'}]

The translation is remarkable. Bing results in pretty much the same:

This is because love is like a tree, it grows of its own accord, throws its roots deep into our whole being, and often continues to grow green on a ruined heart.

Let's eliminate the English and convert it to some other language:

french_es_Translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-es")
french_es_Translator(notreDamedeParisQuote)

# Output: [{'translation_text': 'Es que el amor es como un árbol, crece de sí mismo, echa profundamente sus raíces en todo nuestro ser, y a menudo continúa verdoleando sobre un corazón en ruinas.'}]

I don’t know Spanish, so again, I got help from Bing, and it seems that the model (notice the model’s name, and you can observe that it is not a big model like the English one) did pretty well.

Bing’s version is:

Espanol: Esto se debe a que el amor es como un árbol, crece por sí mismo, echa sus raíces profundamente en todo nuestro ser y, a menudo, continúa reverdeciendo en un corazón arruinado.

Similarly, for German, the Bing version is:

Deustche: Denn die Liebe ist wie ein Baum, sie wächst von selbst, schlägt ihre Wurzeln tief in unser ganzes Wesen und wächst oft weiter grün auf einem zerstörten Herzen.

The Hugging Face model translates it like this:

french_de_Translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-de")
french_de_Translator(notreDamedeParisQuote)

# Output: [{'translation_text': 'Es ist, dass die Liebe ist wie ein Baum, er wächst von sich selbst, wirft tief seine Wurzeln in unser ganzes Wesen, und oft weiterhin auf einem zerrütteten Herzen grün.'}]

Translation from underrepresented languages

The translation is pretty fair so far, but let's make it a bit more challenging for our models by using some other languages - say Persian.

gulistanQuote = " تو کز محنت دیگران بی‌غمی  نشاید که نامت نهند آدمی"

A little search and I'm able to find a good enough model. Here, we need to realize that the majority of NLP research/work is focused on English or some major languages like French or Spanish, and languages like Persian or even Arabic are underrepresented despite such huge literary assets.

persian_en_Translator = pipeline("translation", model="persiannlp/mt5-small-parsinlu-opus-translation_fa_en", max_length=50)

persian_en_Translator(gulistanQuote)

# Output: [{'translation_text': 'You are a fool of the ignorant beings who are not your names'}]

Translation is still challenging for other languages

Although it's not very far off from the correct translation (“If you have no sympathy for human pain The name of human you can not pertain”), it has messed it up. In a fast-paced world such as ours, these inaccuracies are hardly acceptable.

Actually, this little experiment highlights a much bigger issue we have: most languages (even as major as Persian) aren’t lucky enough to have sufficient resources for NLP tasks. The lack of quality datasets and interest from researchers means that we don’t have enough resources to train (or fine-tune) these languages. With some initiatives like Aya, hopefully, things will improve in the coming years.

Conclusion

So, let's wrap up. Globalization has made translation undeniably valuable, but transformers themselves can’t do much until we have sufficient quality datasets available. The skewness of research (including the datasets) towards some Indo-European languages means that there's still a lot of room to contribute to it with new models and especially new datasets.

In the next article of this Hugging Face series, we'll talk about another exciting NLP task: text generation.