A New Audio Uploading Tool for Crowdsourced Wiktionary Project in Odia Language

Categories: South Asia, India, Digital Activism, Language, Literature, Technology

A home recording setup for the Kathabhidhana project for Wiktionary. Image via Subhashish Panigrahi from Wikimedia Commons. CC BY-SA 4.0

Wiktionary, Wikipedia's multilingual sister project, promises a great deal. At present, there are not many open-licensed audio recordings that you can hear or download — especially if your mother tongue is not one of the major languages [2]. Wiktionary is already available in multiple languages and in addition to the definitions of the words, many phonetic notations — at least in terms of the International Phonetic Alphabet (IPA) — are available. Now, an Odia-language community project is helping to simplify the process of volunteer contributions to the Odia Wiktionary [3] project.

Kathabhidhana, a community project led by Global Voices contributor and Odia Wikipedian Subhashish Panigrahi [4], is an open-source solution for recording large chunks of words. It then uploads them under open licenses so that they can be useful for projects like Wiktionary.

Odia [5], one of the state languages in India, is a Indo-Aryan language that is spoken mostly in eastern India by around 40 million native speakers. With over 5,000 years [6] of literary heritage, it has been recognized as one of the oldest South Asian languages, and has been given the status of a classical language [7] by the Indian government.

But thanks to the use of non-Unicode-based typing systems, the language's online presence is still lagging behind. To address these issues, a bunch of character encoding converters [8] that change typed text to Unicode using various non-Unicode encoding systems, are incorporated in Odia Wikipedia [9]; it now has more than 12,000 entries. The Odia Wiktionary, on the other hand, as a free, online-based and completely crowdsourced dictionary in the Odia language, is trying to bridge the gap.

The project draws its inspiration largely from other open-source software [10] created by Shrinivasan T [11], who used Python programming language to automate and simplify the process. He posted this tutorial on YouTube:

Panigrahi was inspired to do the Kathabhidhana project because the existing method [12] was a cumbersome process: you have to pronounce and record a word, then export it in Ogg Vorbis format to your Wikimedia Commons account, which is a central repository of media files for all Wikimedia projects. Once uploaded, the entry is added to the Wiktionary project. Apart from manually recording pronunciation, there is also an open-source text-to-speech project called Dhvani [13] that works for most Indian languages.

In contrast, having audio recordings of words in Wiktionary helps non-native speakers — as well as people with visual disabilities — listen to the pronunciation of different words. The word library can also be used for several Natural Language Processing [14] projects, like building text-to-speech [15] and speech-to-speech [16] engines.

You can download a copy [17] of Kathabhidhana and find all the audio recordings [18] made using this software.