How web scraping is bridging language barriers

The Welcome project

What do web scraping and the integration of third-country nationals have in common? The answer is Hashem Sellat, Ph.D. student and research assistant at Charles University’s Institute of Formal and Applied Linguistics (UFAL). Hashem and UFAL are involved in the Welcome project, which has received funding from the EU Horizon 2020 research and innovation program.

The Welcome project aims to research and develop intelligent technologies to support the reception and integration of migrants and refugees in Europe. 15 partners (and countries) are involved in the project: 6 research institutions, 3 technology companies, and 6 user partners dealing directly with migrant reception and integration. As an outcome, the project will deliver a mobile platform client application that will facilitate the interaction of third-country nationals with virtual agents and a desktop application for the support of authority workers.

How is the institute participating in this? By building a machine translation system to provide instant translation between Arabic dialects and European languages.

What is neural machine translation and how does it work?

If you have ever come across unknown words in a foreign language, you can log on to an online translation platform to find their meaning. That is machine translation. Machine translation, or MT, uses software to convert text in a source language into a target language.

Neural machine translation is a modern approach that utilizes neural networks to do the translation task. It usually consists of two main parts: the encoder, which converts source text into a vector representation, and the decoder, which converts the vector representation into the target language. The amount of data needed to train such a system is not small but feasible to collect, and if training is done well, it produces good results.

Given the complexity of neural machine translation and the risks of using incorrect translations, it’s important to ensure that the translations produced by such systems are accurate and original. One way to verify the quality and originality of translated content is by using a plagiarism checker to ensure that no parts of the translated text have been copied or improperly sourced.

Neural MT for low-resourced languages

In the field of Natural Language Processing, the language is considered low-resourced if textual data available for this language is scarce. However, some techniques can be used to obtain decent results using little data in hand. The neural MT techniques play a huge role in this area, e.g., zero-shot translation, transfer learning, and semi-supervised approaches. Also, some techniques can be used to get the most out of the data, like back translation and data augmentation.

However, even if it is somehow possible to overcome the problem of low-resourced language translation, it is still extremely hard or even impossible to get reasonable results if we have no data at all to start with.

Building the MT module for the Welcome project

The machine translation system for the Welcome project is based on the neural MT paradigm. The supported languages are modern standard Arabic, the Syrian dialect, and the Moroccan dialect (Dajira) on the one side, and English, Spanish, German, Greek, and Catalan on the other side, i.e., 15 pairs of languages in total.The challenge: where to find data?

To get the module off the ground, UFAL needed data and lots of it, but finding data in Syrian and Dajira was a big challenge. Why? Because Arab people use these dialects only in casual conversations, not in formal writing. That means publicly available resources are scarce.

Where could dialectical data be found to feed the MT system? The answer is posts and comments from Twitter and Facebook. On those platforms, Arab people tend to use their spoken dialects rather than standard Arabic. By scraping texts from these two social media platforms, enough dialectical data could be collected to train the MT system.

Web scraping social media comments in different dialects can provide data for machine translation

Why Apify?

Hashem was able to collect tweets coming from specific geolocations by using the Twitter streaming API. Extracting Facebook comments was much more difficult. It was not easy to crawl Facebook at scale because Facebook is very sensitive to using bots, regardless of the purpose. Therefore Hashem turned to Apify due to its rich yet simple interface, its customizability, its very rich API that enables workflow automation, and its wide range of social media platform scrapers, most notably Apify’s Facebook Latest Comments Scraper and Facebook Post Comments Scraper.

The results

Thanks to Apify’s web scraping tools, Hashem managed to collect 380,000 Facebook comments in the Syrian dialect and 120,000 Facebook comments in the Moroccan dialect (as of April 1, 2022).

He finally has the data necessary to proceed with the training of the MT system.

Why do I love Apify? Because of its simple yet rich interface and customizability (we programmers love that!). Apify has solutions for a wide range of social media platforms and a rich API that allows me to automate my workflow.

Hashem Sellat, Ph.D. student and research assistant at the Institute of Formal and Applied Linguistics, Charles University.

Get web scraping solutions

Thanks to Apify’s social media scrapers, UFAL was able to collect the necessary data for its machine translation training model to bridge language barriers and support the reception and integration of migrants and refugees in Europe.

If you want to know how Apify’s web scraping tools can help you in your projects, look at our success stories to find out how web scraping has helped other institutes and companies in their work. You can also check our use cases page to learn how Apify’s web scraping and automation tools can help in a range of areas, from marketing and real estate to academic research and machine learning.