Data to inclusion: Building datasets in African languages
Diversity in languages increases the use of and access to information in current society. Yet, our present reality is that very few languages are used in cyberspace. As well, the digital divide is strengthened by the use of only a few languages online. It is projected that by 2050, Africa will be the continent with the world’s largest population and with the most youth. Yet, African languages are the least represented online and the transition to a digital knowledge society might reduce economic equality if more languages are not included in cyberspace. Ms Dorothy Gordon (Chair of the UNESCO Information for All Programme, The United Nations Educational, Scientific and Cultural Organization (UNESCO)) stressed her concern for Africa, given that many parents do not speak their native languages with their children, primarily to enhance their access to the labour market. Hence, these children are exposed to a limited vocabulary in English or other European language, which has proved to be very damaging to their cognitive systems. Multilingualism is essential to tackle this.
UNESCO has Africa as a priority. Gordon believes that it is important to support innovative solutions to strengthen access to information through AI systems. Data scientists must ensure that their methodologies are open and available, and the use of multilingualism in cyberspace must increase. The value of inclusion and data openness must guide the digital information society. UNESCO has been working on the creation of multistakeholder networks to involve more people in strengthening multilingualism. This network might integrate high tech companies, youth, civil society, linguists, and scientists.
Mr Christian Resch (Junior Advisor Artificial Intelligence, Fair Forward - Artificial Intelligence for All) mentioned that the ‘Open for Good’ alliance will be launched on 25 November. The initiative will create an alliance of partners, including UNESCO and Mozilla, to make data available to the public in different languages. First, this alliance will strengthen a focus on and commitment to localise data. In Uganda, for instance, it is necessary to train people who speak local languages to deal with data. Second, ‘Open for Good’ will work to increase data openness. Resources must be available to train people to deal with data in different languages.
Ms Joyce Nabende (Lecturer, Makerere University, Uganda) addressed the inclusion of African languages in cyberspace. Nabende highlighted that Uganda is a multilingual country with more than forty indigenous languages. However, a gap is apparent due to a lack of availability of data sets for African languages on the Internet. Demand on the African continent is growing and projects such as the Common Voice, Masakhane, AI4D are focusing on indigenous languages in Uganda, South Africa, and Ghana. These projects still need natural language processing to make progress in Africa. The use of African languages in cyberspace is not only useful for including millions of people in the digital information society, but also will improve addressing issues relating to crop pests and disease surveillance. Luganda is the largest spoken language in Uganda, but still under-represented online. The main limitation for building digital tools in Luganda is the lack of data. Efforts have been made to generate data in Luganda, but they are still in initial stages. Nabende’s organisation has surveyed students to generate data, resulting in around 5390 voice utterances for agriculture keywords. It has also transcribed radio recordings, but this data needs licensing. Last September, Nabende’s organisation, in partnership with Mozilla, launched the initiative ‘Luganda on the Common Voice Platform’ which promotes individuals’ collaboration in teaching their devices how to speak Luganda.
Ms Kathleen Siminyu (Regional Network Coordinator, AI4D Network Africa) exemplified the imbalance of language diversity in online content. For example, even though Kiswahili is one of the most spoken languages in her region, only fifty-six thousand articles written in this language are available on Wikipedia, while the platform has more than six million articles in English. Problems related to language diversity, data collection, and training occur since creators create monolingual content, mainly in French and English, instead of using one of their native languages; curators who put data sets together often do not speak local African languages, nor do evaluators of data sets. If the issues of not including African languages nor including native speakers in research are not resolved, data will never be available in local languages.
Mr Roy Boney Jr. (Language Program Manager, Cherokee Nation) provided examples of how his organisation has revitalised the Cherokee language in the United States. The organisation collaborates with companies such as Google and Apple to develop technologies in Cherokee. Moreover, they develop digital content online for adults and children in Cherokee.
The session stressed the topic of digital inclusion. It emphasised how the Internet has provided little support to marginalised and endangered languages. Moreover, it demonstrated that cyberspace oppresses most spoken languages across the world. These language groups lack digital presence as they are underserved and often suppressed from the digital sphere.