A New Audio Uploading Tool for Crowdsourced Wiktionary Project in Odia Language

A home recording setup for the Kathabhidhana project for Wiktionary. Image via Subhashish Panigrahi from Wikimedia Commons. CC BY-SA 4.0

Wiktionary, Wikipedia's multilingual sister project, promises a great deal. At present, there are not many open-licensed audio recordings that you can hear or download — especially if your mother tongue is not one of the major languages. Wiktionary is already available in multiple languages and in addition to the definitions of the words, many phonetic notations — at least in terms of the International Phonetic Alphabet (IPA) — are available. Now, an Odia-language community project is helping to simplify the process of volunteer contributions to the Odia Wiktionary project.

Kathabhidhana, a community project led by Global Voices contributor and Odia Wikipedian Subhashish Panigrahi, is an open-source solution for recording large chunks of words. It then uploads them under open licenses so that they can be useful for projects like Wiktionary.

Odia, one of the state languages in India, is a Indo-Aryan language that is spoken mostly in eastern India by around 40 million native speakers. With over 5,000 years of literary heritage, it has been recognized as one of the oldest South Asian languages, and has been given the status of a classical language by the Indian government.

But thanks to the use of non-Unicode-based typing systems, the language's online presence is still lagging behind. To address these issues, a bunch of character encoding converters that change typed text to Unicode using various non-Unicode encoding systems, are incorporated in Odia Wikipedia; it now has more than 12,000 entries. The Odia Wiktionary, on the other hand, as a free, online-based and completely crowdsourced dictionary in the Odia language, is trying to bridge the gap.

The project draws its inspiration largely from other open-source software created by Shrinivasan T, who used Python programming language to automate and simplify the process. He posted this tutorial on YouTube:

Panigrahi was inspired to do the Kathabhidhana project because the existing method was a cumbersome process: you have to pronounce and record a word, then export it in Ogg Vorbis format to your Wikimedia Commons account, which is a central repository of media files for all Wikimedia projects. Once uploaded, the entry is added to the Wiktionary project. Apart from manually recording pronunciation, there is also an open-source text-to-speech project called Dhvani that works for most Indian languages.

In contrast, having audio recordings of words in Wiktionary helps non-native speakers — as well as people with visual disabilities — listen to the pronunciation of different words. The word library can also be used for several Natural Language Processing projects, like building text-to-speech and speech-to-speech engines.

You can download a copy of Kathabhidhana and find all the audio recordings made using this software.