While open-source initiatives are still underway, Google and Microsoft have both added the Odia language of India to their respective machine translation engines this year — Google Translate in February and Microsoft more recently on August 13.
Odia is the official language of the Indian state of Odisha and the second official language in the state of Jharkhand. It's spoken by about 35 million native speakers and as a second language by about 4 million people. It is also classified by the Indian government as one of the country's classical languages based on a set of requirements that includes a literary tradition of more than 1,500 years.
However, Odia's digital presence is limited. For instance, the Odia Wikipedia, which is one of the largest repository of textual content, currently has only 15,858 articles after being revived in 2011 following a nine-year-long hiatus. In contrast, Malayalm, which is spoken by about almost the same number of speakers as Odia, has about 70K articles on Wikipedia. Content in Odia used to be available online in the form of image and PDF for a very long time — some including the Odisha state government-run magazine Utkal Prasanga still continue to publish in a combination of image and PDF. Late adoption of Unicode has made content less searchable.
Machine translation significantly contributes to increasing the digital presence of languages by making content more searchable and more accessible to non-speakers.
Microsoft-operated cloud services, including Microsoft Translator app, Office, Translator for Bing, and through the Azure Cognitive Services Translator, will now all support translations from Odia. Both Microsoft Translator and Google Translate (available both on the web and as an app) allow translation of text copied directly into the input field.
Additionally, these platforms also support the translation of text documents, websites, and of live-chats. The Google Translate mobile app has additional features, including offline translation, handwriting recognition, scanning, translating and reading text from images, and using voice command to speak to a foreign-language speaker. A feature called “tap to translate” allows a user to directly translate a typed text inside any app. One can also hear how a text in a supported language is pronounced using Google’s speech synthesis.
The addition of Odia was thus well-received by the Odisha state government. The Office of the Chief Minister of Odisha tweeted:
#OdiaTranslation has now been added by @Microsoft to its @mstranslator, becoming the 12th commonly used Indian language to be added. This will facilitate access of global information in #Odia and promote inter-language interactions. https://t.co/O4dZgZhbrs
— CMO Odisha (@CMO_Odisha) August 17, 2020
The Electronics and Information Technology Department of the Government of Odisha also reacted:
Used by millions across the world, @Google Translate has now added #Odia to its list of supported languages. A major step towards promoting digital literacy in our native language & to help millions of non-speakers embrace it. #OdiaOnGoogle @CMO_Odisha https://t.co/lfSskvxSjR
— E&IT Department Odisha (@EIT_Odisha) February 28, 2020
Machine Translation
Machine translation is used to translate the text or speech of the source language to a target language. The translation that Google uses relies on the Neural Machine Translation, a computational system that uses a technique called artificial neural network where large datasets consisting of translation of phrases (from source to target language) are used for training.
With the inclusion of Odia, Google Translate and Microsoft Translator now support 11 Indian languages each. In total, Google supports 109 world languages while Microsoft supports 73.
Meanwhile, open-source initiatives are yet to create successful machine translation projects in Odia.
At least one open-source community-led project is in development — MTEnglish2Odia is training a machine translation engine by collecting translation pairs from existing sources such as Odia-language Wikipedia and crowdsourcing from user contributions on Twitter.
In addition, there are some research and resources that can be used for building machine translation engines by other organizations.
The politics of machine translation
The technology used by Google Translate or Microsoft Translator is complex from social, legal, ethical, and rights aspects.
A machine translation platform can be of great use for many people, including journalists to quickly access news from multiple languages, or students desiring to learn from multilingual resources.
Similarly, voice synthesis support helps persons with a disability, especially blindness, to access and share information more easily.
Education, media, and the entertainment industry also benefit from the potential of Google Translate to translate vast amounts of content in a short period of time.
On the other hand, machine translation can contribute to spreading misinformation, while voice synthesis facilitates fraudsters who look to prey on people by communicating with them in their language.
There are over 6,000 documented languages around the world, and only a minority of them have established writing systems. Those are the languages that get to be included in machine translation projects such as Google Translate and Microsoft Translator.
The availability of online content, as well as the number of internet users who speak a given language, are major factors for-profit corporations consider when deciding which languages to include in their systems. The more languages a corporation supports, the more targeted content it can deliver to users — and the more revenue it generates from ads.
In addition, there are ethical questions of attribution and remuneration in projects such as Google Translate, which has a contributor community structure to review existing translations (that helps engineers regularly improve the tool).
While Google is a for-profit company with many paid products — including a cloud translation service — neither the individual volunteers nor the numerous public sources from where the machine learns are attributed or remunerated.
The use of private communications for improving machine learning and artificial intelligence are also controversial from a privacy standpoint — though Google has been working towards anonymizing such data.
In a country like India, where the creation of multilingual content faces bottle-necks due to costs, products such as Google Translate and Microsoft Translator can revolutionize the Indian content economy. They can make a difference for projects such as Wikipedia, which currently exist in 23 Indian languages, or StoryWeaver, a multilingual online children’s literature platform that heavily relies on volunteer work.
With many Indian languages disappearing rapidly, and with the added challenge of illiteracy and digital accessibility, the communications pathway needs more innovation in voice and visual technology. Machine translation can be a viable tool to stop language extinction — but in India, it still has a long way to go.
Disclaimer: The author has been involved with Odia Wikipedia as a volunteer since 2011 and with MTEnglish2Odia since its early stages.