WS #254 The Human Rights Impact of Underrepresented Languages in AI
WS #254 The Human Rights Impact of Underrepresented Languages in AI
Session at a Glance
Summary
This panel discussion focused on the impact of underrepresented languages in AI, particularly in large language models. The speakers highlighted how the dominance of English and Western languages in AI training data leads to bias and exclusion of other languages and cultures. They discussed how this affects human rights, socioeconomic opportunities, and cultural preservation for speakers of underrepresented languages.
Key issues raised included the poor performance of AI models in non-dominant languages, the risk of further marginalizing minority languages, and the ethical and legal implications of using AI trained on limited language data for critical applications like asylum processing. The speakers emphasized the need for more diverse, high-quality datasets and greater transparency in AI development.
Legal and policy solutions were explored, including copyright law adaptations, personality rights considerations, and international cooperation for knowledge sharing and capacity building. The panelists noted the challenges of creating universal data platforms due to commercial interests but highlighted some promising initiatives for language inclusion in AI.
The discussion also touched on the role of governments in supporting local language AI development and the complex interplay between education systems, economic incentives, and language preservation efforts. Overall, the panelists stressed the importance of inclusivity in AI as a human rights issue and called for more holistic approaches to address language representation in AI technologies.
Keypoints
Major discussion points:
– The impact of underrepresentation of languages in AI from human rights and socioeconomic perspectives
– The role of legal and ethical frameworks in enhancing AI inclusivity and language-based inclusion
– Copyright and intellectual property issues related to AI datasets and language inclusion
– Challenges and potential solutions for creating more diverse and inclusive AI on an international level
– Government support and incentives for developing AI in local languages
Overall purpose:
The discussion aimed to explore the challenges of language underrepresentation in AI systems and datasets, and to consider potential solutions for creating more linguistically diverse and inclusive AI technologies on both national and international levels.
Tone:
The tone of the discussion was largely analytical and informative, with speakers providing in-depth explanations of complex issues. There was also an undercurrent of concern about the societal impacts of language exclusion in AI. The tone remained consistent throughout, maintaining a balance between highlighting problems and proposing potential solutions.
Speakers
– Moderator: Luis Dehnert, fellow with International Digital Policy at the German Ministry for Digital and Transport
– Nidhi Singh: Project manager at the Center for Communication Governance in the National Law University Delhi, India. Works in information technology law and policy, AI governance and ethics.
– Gustavo Fonseca Ribeiro: Lawyer from Brazil, holds a Master’s of Public Policy from Sciences Po. Specialist consultant for AI and digital transformation at UNESCO. Youth ambassador for the Internet Society in 2024.
– Kathleen Scoggin: Online moderator
Additional speakers:
– Audience member: Asked questions about government support for local language initiatives and language requirements in education
Full session report
Language Underrepresentation in AI: Impacts, Challenges, and Potential Solutions
This panel discussion, moderated by Luis Dehnert from the German Ministry for Digital and Transport, explored the critical issue of language underrepresentation in artificial intelligence (AI), particularly in large language models. The speakers, Nidhi Singh from the Center for Communication Governance in India and Gustavo Fonseca Ribeiro, a specialist consultant for AI and digital transformation at UNESCO, provided in-depth insights into the multifaceted challenges and potential solutions surrounding this topic.
Impact of Language Underrepresentation in AI
The panelists agreed that the dominance of English and Western languages in AI training data leads to significant bias and exclusion of other languages and cultures. This underrepresentation has far-reaching consequences:
1. Human Rights and Cultural Preservation: Both speakers emphasized that the exclusion of non-dominant dialects and languages affects cultural rights and threatens cultural identity preservation.
2. Socioeconomic Implications: The panelists concurred that language exclusion in AI exacerbates the digital divide, potentially limiting economic opportunities for speakers of underrepresented languages.
3. Nuanced Exclusion: Nidhi Singh highlighted that even within English, only the most common internet dialect is represented, effectively excluding many English speakers as well.
4. Educational Discrimination: Singh provided a concrete example of how language bias in AI can lead to discrimination in education, noting that non-native English speakers are more likely to be flagged for plagiarism by AI-powered detection systems.
5. Legal Implications: Ribeiro mentioned the use of AI in Afghan asylum cases, highlighting the potential for bias in critical legal decisions.
Legal and Ethical Frameworks for AI Inclusivity
The discussion emphasized the crucial role of legal and ethical frameworks in enhancing AI inclusivity:
1. Detailed Implementation: Singh stressed the need for detailed implementation guidelines beyond broad inclusivity frameworks.
2. Copyright Law Adaptations: Ribeiro discussed the potential for copyright exceptions for data mining to facilitate AI development while protecting intellectual property rights.
3. Transparency and Accountability: Both speakers agreed on the importance of transparency and accountability mechanisms in AI development.
4. Traditional Knowledge Protection: Ribeiro introduced the concept of traditional knowledge protection under international intellectual property law as a potential framework for addressing data rights of underrepresented communities.
5. Personality Rights: Ribeiro highlighted the importance of considering personality rights in AI development and data usage.
International Efforts and Challenges
The panelists explored various international efforts and challenges in building more diverse and inclusive AI:
1. State-Driven Initiatives: Singh highlighted the importance of state-driven initiatives for language inclusion in AI, such as the Karya initiative in India.
2. Knowledge Sharing: Ribeiro emphasized the role of international organizations in facilitating knowledge sharing and capacity building.
3. Accessibility Framing: Singh suggested framing language inclusion as an accessibility issue to leverage existing legal frameworks.
4. Power Balancing: Both speakers stressed the need for balancing power in international policy conversations, particularly between the global north and global majority.
5. EU Data Governance Act: Ribeiro mentioned the Data Governance Act in the European Union as an attempt to create data commons and address challenges in data sharing.
Challenges in Creating Universal Data Platforms
The discussion revealed unexpected consensus on the challenges of creating universal data platforms for AI:
1. Economic Disincentives: Singh pointed out the strong economic incentives against data sharing for private companies.
2. Financial Sustainability: Ribeiro highlighted the tradeoffs between openness and financial sustainability in data sharing initiatives.
3. Data Protection Concerns: Both speakers acknowledged personal data protection as a significant obstacle to universal data platforms.
Role of International Organizations
Ribeiro elaborated on the role of international organizations, particularly UNESCO, in addressing language underrepresentation:
1. Capacity Building: Supporting member states in developing AI strategies and policies.
2. Standard Setting: Developing ethical guidelines and recommendations for AI development.
3. Facilitating Dialogue: Creating platforms for knowledge sharing and discussion among diverse stakeholders.
Government Support and Challenges
In response to an audience question, the panelists discussed government support for local language initiatives:
1. African Initiatives: Ribeiro provided examples of government support for AI development in African countries.
2. Inclusive Decision-Making: Singh emphasized the need for more inclusive decision-making processes in government initiatives.
3. Market Forces: Both speakers acknowledged the significant role of market forces in driving AI development.
4. Educational Challenges: Singh highlighted the complex interplay between education systems, economic incentives, and language preservation efforts, noting the difficulty in balancing English proficiency with local language preservation.
Conclusion
The panel discussion underscored the critical importance of addressing language underrepresentation in AI as both a human rights issue and a key factor in equitable technological development. The speakers emphasized the need for holistic approaches that combine legal and ethical frameworks, international cooperation, government support, and innovative technical solutions. While significant challenges remain, particularly in balancing economic incentives with inclusivity goals, the discussion highlighted promising initiatives and potential pathways for creating more linguistically diverse and inclusive AI technologies on both national and international levels.
Session Transcript
Moderator: Hello, can you guys hear me? We would start with the session now, so please go to channel 1 and then if you can give me a thumbs up if it’s working, that would be great. Cool. Okay. Thank you for joining. Our panel on the impact of underrepresented languages in AI. My name is Luis Dehnert, I’ll be moderating today, sorry for the slight delay, we’re trying to make up for that now. I myself, I’m a fellow with the International Digital Policy with the German Ministry for Digital and Transport. So obviously we heard about the AI divide already early on today, so the issue of representation and diversity is a critical subject for inclusivity in the digital age. So AI is also a technology that is increasingly attempting to model our reality based on training data, but data often fails to capture that reality. So this is especially the case for languages. For example, we have the commonly used so-called common crawl, a data set that is made of nearly everything on the internet and which is often used to train large language models, yet nearly half of the data in it is in English and it leaves out more than 8,000 documented languages by UNESCO worldwide. So we are going to discuss this topic today. We are joined, as far as I know, by two speakers, which I will now let them introduce themselves, so we start with a needy thing. So I give the floor to you to introduce yourself, please. Yeah. OK. Hi.
Nidhi Singh: Thank you so much for inviting me here today. My name is Nidhi Singh. I am a project manager at the Center for Communication Governance in the National Law University Delhi in India. I work primarily in information technology, law and policy and for about the last five years I’ve been working in AI governance and AI ethics, focusing on a global majority approach to how AI is being developed, how norms are being formed, and how it’s being regulated and governed at the international stage.
Moderator: Thank you so much. Now I would like to give the floor to Gustavo Ribeiro. I think he has joined us online. Could you please introduce yourself?
Gustavo Fonseca Ribeiro: Hello, good afternoon to everyone in Riyadh. I apologize because I think my camera is malfunctioning. We were trying to fix it before we started, but I was not able. I will introduce myself and then I’ll try to leave and rejoin so I can see if the camera is working. But thank you all for joining, really appreciate your presence here. So my name is Gustavo Fonseca Ribeiro. I’m a lawyer from Brazil. I hold a Master’s of Public Policy from Sciences Po, a university based in France. And I’m also a specialist consultant for AI and digital transformation at UNESCO. Here at the Global IGF, I’m speaking in my capacity as one of the youth ambassadors for the Internet Society in the year 2024. So I’m very happy to join this meeting. Yes. Thank you, Luis.
Moderator: Thank you so much, Gustavo, for joining. Yes, you can try to rejoin with video. We’d love that. But Nidhi, maybe I start with you with the first question. So can you tell us a bit… What are the impacts of under-representation of languages in AI from a human rights and also socioeconomic perspective?
Nidhi Singh: Yeah, thank you so much for the question. So I think this is something we’ve broadly said this in the introduction as well, but there’s a lot of concerns around bias and inclusivity that comes in. But before we talk about it, I just want to talk about how, when we talk about what languages that AI models are being trained on, it’s not even really just that the resource-heavy languages like English are being adapted. Only specific dialects of English are being adapted. So it’s not even like the English that we all speak, especially for non-native speakers, is the language that’s going into the model. So we’re not part of the majority in either case. Even if you do speak English, it’s not your dialect of English that goes in. So even within that, it’s only the version of English that’s most commonly present on the internet is something that’s being trained on. So in a sense, everybody’s sort of being excluded. When you look at what happens from this, there’s a couple of just use cases I wanted to bring up before we get into deeper discussion. There’s very real-world consequences of this. So as generative AI has gone up, universities have started using models to check if generative AI is being used, if students are cheating, if they’re using generative AI to turn in their homework or to write their papers. As a non-native speaker of English, even if you speak English with a high degree of proficiency, you are far more likely to be flagged for plagiarism. Because the way that these tools are developed, it’s actually developed only for native speakers of a certain dialect of English. So that’s one thing. The other is also AI-driven translation softwares that are used, which are being increasingly used in welfare uses by the state as well. They are also not working well for low-resource languages. So that’s another concern. So what happens is that as part of the internet which does not speak the majority language or the majority dialect, you are already not part of the majority. And as this digital divide of the people sort of increases, the language becomes a further barrier to that. So if the generative AI also just generates in the predominant dialect, there’s a very decent chance that in a few years, the internet will only be filled with this one dialect of English, only a few resource-heavy languages, and all of the other languages. languages will increasingly be removed from the internet. Another thing you can see is that it’s only generative AI content that’s now coming up on the internet. So as that gets collected for more training, it’ll just be one language that’s sort of getting repeated. And your native dialects, and the way that you speak, and your cultural identity on the internet will slowly be lost. So I think there’s a lot of implications of what happens when generative AI models, typically ones that are actually focusing on prompt-based answers, have such a big problem with how they’ve been trained in terms of language.
Moderator: OK, thank you very much for the answer. I would like to give the same answer to the same question to Gustavo. Could you please share your view on that, please?
Gustavo Fonseca Ribeiro: Luis, can you repeat the question, please? Because I was resetting my camera while you asked it.
Moderator: I was asking, what are the impacts of under-representation of languages in AI from a human rights and also a socioeconomic perspective?
Gustavo Fonseca Ribeiro: Of course. That is quite interesting. So when you think of languages in artificial intelligence, the first thought that comes to mind in terms of human rights is cultural rights. If we look at the International Covenant, particularly the one on human rights, particularly the one economic, social, and cultural rights, there are, broadly speaking, three cultural rights protected under international human rights law. The first one being the right of access to culture. The second one being the right of a society or people to guide, to steer the progress of their own scientific progress. And the third one relates to intellectual property. So in terms of human rights, I think to understand this, we to understand the impact of under representation of languages. We have to understand that these technologies, artificial intelligence and the data sets that are fueling it are being, as you mentioned, primarily developed in a Western setting, let’s say that way, for instance, United States, or with European languages, with the exception perhaps of China in Asia. So when these tools are translated into other contexts, contexts that speak different languages, they’re not going to work, they’re not going to perform as well. And this does affect, so this does affect those communities, the rights that I have just mentioned, it’s going to affect how the scientific community explores this new technology. And it’s also going to affect how everyday users of artificial intelligence relate culturally to the outputs of the technology. And in terms of socioeconomic benefits, I would say that you can think of this in two ways, you can think of this through supply and demand. In terms of demand for AI technology, I think this usually happens because it can bring a lot of productivity. But again, if there’s under representation of language, the people that speak that language, they would, they’re not going to reap the same benefits. If you look at the major language models out there, such as chat GPT, they perform very well in English, but they perform very poorly, for example, in African languages. So this is one socioeconomic benefit that it’s not going to be reaped. And on the side of the supply, though, we can think of opportunities, though, because if there is a demand, trade from local communities, right? There’s also an opportunity for local companies to come up. And we do have an example of this, some examples of this in Africa, for example, in Ghana, you have Ghana NLP, which is a startup producing language models in Ghanaian languages. And by the way, there are over 50 languages in Ghana alone. In South Africa, there is Lilapa AI. And another example is the Masakani Foundation, which is a Pan-African organization working also on advanced language inclusion. So I would say those are the main impacts of it. Thank you.
Moderator: Gustavo, so I want to turn now to another aspect of this. So Nidhi, what is the role of legal or ethical frameworks in enhancing AI inclusivity? How can this further language-based inclusion work in your opinion? Thank you.
Nidhi Singh: So I think when it comes to AI governance, there are of course now some legal instruments that are coming up. Even without the legal instruments, of course, countries have their own sort of constitutional protections and human rights protections that will still apply. But primarily a lot of AI deployment is still being governed through the use of AI ethics. There are more broad based sort of frameworks around which you can have AI deployment. And inclusivity is a key framework in almost all of them. So the UNESCO AI ethics, even the OECD ones, all of them have something on inclusivity. Now how that’s to be implemented is actually a really interesting question. Because when you look at something like large language models and trainings, inclusivity just dictates that you should make it available in all of the languages and you should have training in all of the languages. But that’s not very helpful because you actually need very high quality data sets in order to train the models. And that means that it would require a significant amount of time and investment to get those models. So just to give you an example, if you try to use chat GPT in any of the Indian languages, not maybe the bigger ones, but some of the smaller ones like Assamese or Kannada, which is something that we tried to do for fun. At some point, it will start repeating Bollywood dialogues to you, because they ran out of things to train it on. And the easiest thing they could find was Bollywood movies, and they started training large language models on that. So you’ve met the quota, basically, you’ve checkmarked that it is inclusive, but it’s not, that doesn’t make any sense, the model doesn’t actually work. So inclusivity, I think AI ethics, they make a good framework. But to implement that framework, you need to have a lot more detail attached there. Another case study that I want to talk about, this is something that has very real legal concerns, is that in the US, four in 10 Afghan asylum cases would be driven translation algorithm that was used a generative AI based translation algorithm. And they didn’t put a lot of effort into how it translated Afghan into English. And because of that, asylum applications were being derailed. So if you are going to use them, this is like a legal consideration. So if you are going to be using generative AI in such specific, but also important and critical aspects of your public welfare, like if you’re using it in healthcare, if you’re using it for asylum, you’re using it for security, you’re using it for any sort of public welfare benefits. Ideally, you want to make sure that it works really well across all languages. This is especially true in countries in the global majority, which generally have a large diversity of languages. Like Gustavo said, these models are typically made and trained in the global north, where they don’t have so much trouble with like such drastically different languages, so many dialects. So what will happen in these countries is that the people who potentially know English, but also the dominant dialect of English, will probably be able to access that and the people who don’t will not be able to access them. That further creates a larger digital divide. So even in English speaking countries, it really only recognizes the dominant dialect. So like the have dialects and all of the other dialects that are being used typically by immigrant communities in these countries also don’t get recognized as well. Which means that when you look again, specifically at when you ask questions of chat GPT, there is a whole field now from engineering, where if you phrase the questions just right, you’ll get a much better answer. That is, again, something that’s very dependent on you knowing the dominant language and being able to understand how to phrase that very specifically. So a large part of LLMs are designed to give you answers in the dominant language, only if you ask them properly in the dominant language. So the setting does kind of work against you. And it requires at the very fundamental level, a lot of change and investment and effort to be made into it. That might sound like it’s a private concern, but it’s really not considering that you’re using generate generative AI to check for cheating. And based on that, you know, potentially like ending somebody’s education career or branding them as somebody who’s plagiarized. These are things that should have an actual framework, which is of much higher threshold. Like this isn’t about us using chat GPT, Google jokes, or check the weather, or something like this. These are some things that have very real world consequences. So you shouldn’t ideally when you’re deploying AI in these contexts, especially generative AI is something that’s so dependent on language in your culture, it needs to have additional safeguards in place. As of right now, it’s, I think people are really only looking at technical solutions to these problems. I don’t think social and legal problems can really only be solved by technical solutions. So there needs to be I think, a more holistic approach to how you would approach this, but there does need to be some framework in place. If I can get back on this, you said, of course, legal and governance solution should be at the forefront. But can I ask you, maybe from a technical perspective, do you think perhaps, because with many languages, you also have the problem of data availability? Do you think technical solutions such as as synthetic data generation could be a potential solution to also address this? So I think synthetic data generation is something that’s gotten a lot of traction over the last couple of years. And then that’s gotten one set further, where now you have super synthetic data, the data that is synthesized from synthetic data. And I think all of those technical solutions work well within certain areas where they’ve been researched. At the end of the day, if your base data set isn’t of high quality, so these are what we call resource-intensive languages or resource-rich languages. If you don’t have the base data set built up and you try to generate synthetic data out of it, maybe the technology will catch up. But right now, it just sort of exacerbates the problems in it. Another problem is that languages don’t actually typically directly translate. Because the way you write the languages and the way the concepts are sort of explained in Asian languages versus European languages does tend to differ. And in a lot of these LLMs, what it does, it has some basic idea of what the words means. And so it’ll just try to directly translate. That doesn’t really work. And then also, if you’re using synthetic data and your base data wasn’t very good quality data, you’ll just end up with a lot of synthetic data, which is also replicating the same problems. And that could have further problems going down. This is something that I guess you’d need a lot of impact assessment. You’d need a lot of transparency in the model to figure out, which is, of course, something that we’re struggling with right now. Because the problem with an LLM is you need a large amount of data. If you sat down and you tried to individually check every single piece of data, we’d never be able to build an LLM. So these are things I think that, yes, you’d need technical solutions. Synthetic data could be one of them. But there’s no way to be sure that that won’t just exacerbate the problem right now. So unless there’s some way of figuring out, OK, unless we’re sure of how this works, let’s at least not use it in our justice system, in our welfare delivery system, in our asylum system. Until you have that at least pause put in there, I think that’s just going to keep worsening the problem.
Moderator: Thank you for the well-thought response. Gustavo, I’d like to get back to you. So from a legal perspective, what do you think could be other levers, so to speak, to enable more diverse datasets in AI, maybe also perhaps thinking about copyright?
Gustavo Fonseca Ribeiro: So yeah, thank you for the question. So one of the things that I had mentioned earlier, one of the three international human rights affected, right, was intellectual property. And I thought this was going to become a longer conversation. That’s why I didn’t mention anything in the first one. So yeah, I think one key area of law that calibrates access to data, right, not only language, but all types of data, right, is copyright law. But before going there, what is copyright, right? So everybody’s in the same page. Copyright is basically how we assign intellectual property rights to authors. And this applies as well. So when you build a dataset, that is something that you can do as well. You can protect the data with copyright, then you have an exclusive use for that, and only you can license it. That also happens with the source material that is used to create data. So creative works. So whenever we’re using, for instance, newspapers, or like Nidhi said, Bollywood movies, those are protected by copyright. So the way we work with the copyright governance of these two types of resources, right, the datasets and the raw materials for the datasets, is going to affect access to data. But right now, our copyrights, we do have an opportunity for expansion, right, for innovation, but our copyright laws are not necessarily yet adapted to it. We’ve seen a handful, for example, of lawsuits in the United States, between between OpenAI and the New York Times, because OpenAI used the New York Times without authorization. And sorry, there’s some noise around. Yeah. So this is the context, this is the problem. But there are some solutions already out there in terms of copyright. So for example, if you look at the European Union, the AI Act, they do have an exception to what they call the data mining exception. So originally would not be permissible for an AI company to mine data from this raw materials from the source materials like Bollywood movies or from the internet without authorization, the copyright owner. But in the UA Act, there’s an exception to that if you’re doing it for non commercial purposes. So that’s one way to develop to allow for the progress of science. In the progress of science, but still limit the shared benefit of resources, right? Second, a second development that we might see soon, it’s in the US, they have an exception to copyright, which is called the fair use doctrine, which is not so binary is not so certain how it applies, it’s it is on a case by case basis, based on a certain a set of criteria, such as whether this copyrighted certain copyrighted material is being used for educational purposes, or non commercial purposes, or for research, for example. And some of these lawsuits that we see in the US, we might see something like that coming out of it. But we have to wait and see another area. And in my opinion, this is under explored. I would love to see a larger discussion on it. There exists this concept under international intellectual property, law of traditional knowledge. It protects traditional knowledge and traditional knowledge in the sense of knowledge that has been passed from generation to generation, for example, in traditional and indigenous communities. And in the beginning of the late 90s, early 2000s, we saw a lot of debate on that, when it came to healthcare, when it came to medicines and traditional medicines, because you’d see a lot of big companies using this traditional medicines, and which were invented by communities and and this community is not benefiting from it. And yet, we’re still we yet to see this debate in the context of artificial intelligence and data, data has become per se the new oil. But we don’t really, we haven’t seen any stakeholders talking very strongly about how this idea of traditional knowledge communicates with this new, very valuable resource. And just to conclude another challenge that we have, which is not copyright law, but is associated with it is personality rights. The personality right is, for example, how, how is the rights to your own likeness, your voice, your image, your face, you know, and other people can only use it if you have given consent. And this is not an economic right, it’s a moral right, it’s attached to personality. It’s belongs to you because you’re human, not because you own any property, which is often the case, which is, generally speaking, what happens with copyright. So what that means is that you what that means is that you cannot be easily given away, sold away in a market, for example. So right now, we’ve actually seen some decisions by courts, for example, in India, the High Court of Bombay has found that like when an AI replicates the voice of someone who exists, that is a violation of personality rights. But what we don’t know is if taking the voice of someone and using it to train an AI, and if that in itself is a validation, like what if an AI is trained with it, but we don’t know, but it doesn’t necessarily replicate it, right? That is an area that we don’t know. And deciding on that the law would also require some adaptation to either allow for the expansion of data or narrow it, right? Maybe that they conclude my remarks. Yeah. Thank you.
Moderator: Very insightful. Thank you. Thank you very much. I would have would like to pose one last question to the both of you. So in my opinion, we can observe that, you know, this this problem of limited language inclusion in AI leads to efforts on a national or regional level, where countries and regions build their local data sets and models. So thinking more of the international perspective, what steps can we take to together on in within international organizations or beyond build more diverse AI? I would, yeah, Nidhi, maybe you want to start?
Nidhi Singh: Thank you. And you’re right, actually, I think a lot of the efforts that come for something like this come domestically. Because a I think that’s maybe a better place to start, but also probably because state and governments have a far better incentive. So you don’t make AI accessible, because it’s financially lucrative, it may not always be depending on the community. And if it’s not financially lucrative, then why would somebody like open AI be doing this? So it’s usually up to the states to make this. So just to give an example, also like the large language models that have been run in regional languages in India, like Jugalbandi, they’re also done with in partnership by state. So I think that in this case, it’s important to focus on the fact that this isn’t like a favor that’s being done. Inclusivity is, in fact, an essential part of, it’s an essential access to the internet, is, in fact, considered now an international human right. So if you are launching something like LLM, which is fast trying to replace large parts of the internet, you are required to make that accessible to everybody. And like Gustavo said, if you are using public data, there is an, I think, implied expectation that you would use it for public good. So it’s not like, oh, you can also do public good with it, but no, you are, in fact, chaining it on everybody’s data, so you are required to do good with it. So I think that it’ll be good to maybe, in international organizations, bring this under the head of accessibility and inclusivity, and in that sense, sort of apply the same protections to it that we are doing when we’re trying to spread internet to everybody, generally trying to bring everybody online. And yeah, I think, to some extent, even for private players, depending on how the initiative is structured, there might need to be some requirements of including at least some percentage of languages within their training data set. But all of this will really only be possible once you have transparency and accountability mechanisms down, because we don’t actually know how the data is, what data is being collected, exactly how they’re training it. So unless you have very solid transparency mechanisms and accountability mechanisms to see what they’re using to train all of these models with, I think it’ll be hard to really push anything through. It’s a very interconnected problem. You want to have inclusivity. In order to have that, you want to have accountability and transparency. So yeah, I think once you get from ethical AI to actually implementing it, I imagine a lot of the things will get sorted.
Moderator: Gustavo, would you like to add to that?
Gustavo Fonseca Ribeiro: Yeah, of course. So what can we do at the international level, and international organizations in particular? Three things that have come to mind. The first one is sharing of knowledge. and strategies to enhance language inclusion. We can learn from, countries can learn from one another, right? For example, in India, there’s this great initiative called Karya, which is a nonprofit that has this platform that allows data workers and data annotators to go on it, work on providing information and curating data sets. And once this data set is sold by Karya, the profit, the revenue associated with that data set is distributed with the workers. So here we have a case that first we get increased inclusion of Indian languages, and we manage to revert the profits of it to a socially beneficial purpose, right? To helping data workers, which are often don’t have the best working conditions in the value chain of artificial intelligence. So bringing that from one country, and this platform, for example, is open source. So international organizations, they do have the ability of bringing knowledge that exists in India, for example, to other countries that are in similar situations, for example, in Africa, or with indigenous languages in Latin America. The second one I would say is capacity building. Whatever we do with artificial intelligence, we often need data and technology and models. We often need governance as well, like laws that enable its development, but we always need human talent. So even this example that I gave with Karya, a platform doesn’t exist by itself, right? It needs training, people around it to do it. So I think international organizations can work in capacity building. It’s actually been one of the priorities outlined in the Global Digital Compact when it comes to artificial intelligence. And I would say the third one is bringing a better balance of power to conversations on international policy. It’s when we’re speaking of digital divides in particular, there is a clear imbalance of power and ability to tailor this conversation, for example, between the global north and the global majority. And wealthier countries have more resources to participate in this conversation in comparison to low-income countries, for example. So, international organizations, they also have the ability to provide a forum in which these different actors can talk face-to-face. So, I would say these are three potential roles. Thank you.
Moderator: Thank you. With that, I would like to open up this session to questions, either from the in-person audience or from online. I think we have an online moderator, so please let us know if there are any questions online. And, yeah, please also state to whom you want to direct your question. We have a mic in person right here, so if you have a question, please step forward. And, yes, you’re free to ask now. Any questions from the online audience?
Kathleen Scoggin: We don’t have any as of now, but if there aren’t any, I have one to throw out to the group. There’s been lots of talk about the way that we use platforms to gather this data. Either from a technical sense or from more of just a user interface sense, are there things that you all think would be beneficial to creating a universal platform for data collection? There have been lots of different ones. How would you see that going? Thanks.
Nidhi Singh: Should I start? Universal platform for data collection is a very interesting idea. I think that I’ve also heard a lot of conversations about AI Commons and data Commons. And yeah, the principle behind it is quite sound. Because basically what you’re saying is that you put all the data in one place so that everybody can benefit from it. And I do think that, in principle, that sounds good. I do think, however, and I think this might be a bit pessimistic of me, realistically data is quite like the new oil. It’s very unlikely that anybody who has a lot of data would want to share it in a way where other people can profit off of it as well. Well-labeled data is currently one of the biggest resources that we have. Like good clean data that you get for training these things. There are economies now built around these kind of data and data brokers. So I do think that actually putting that into place would be difficult. It’s also very interesting, and this is something that I thought of after Gustavo was speaking. We’ve talked a lot about copyright, but I think it’s quite impressive how much acceptance things like large language models have in our society right now. So when something trawls the internet to collect information to train a large language model, there is a very good chance that you will end up catching a lot of personal data as well. And typically personal data protection used to be very stringent about what you can train models on. Now we’re seeing an increasing trend where because of the lucrative promise of LLMs, countries are just saying that if you posted it on the internet and a large language model caught it, then I guess that’s just fine. As long as it isn’t directly harming your privacy in a way that the output is harming you. the case of the Bombay High Court, where his voice was directly being used, a lot of these protections are diluted when it comes to training the LLM itself. In that era of people sort of prioritizing, I think, economic progress or economic incentives that you can potentially get from generative AI, I think it’s very unlikely that people would agree to pool all of the data in a uniform platform. I hope that that would be the case, but I do think that it’s unlikely that people would agree to that.
Gustavo Fonseca Ribeiro: Yeah, and if I may jump on the question as well. Yes, sure, please. Thanks. I find it to be a very interesting proposal to think about data commons, right? I will first speak of a challenge and then of someone who’s trying to do that. The biggest challenge I see is how bringing data to the open source world, opening data up in a market of technology that is so highly competitive, creates, affects, comparative advantages, so market advantages, right? So it is true that openness, intuitively speaking, leads to more access to data. And you often see this advocacy, right? Oh, to companies, you see it also a lot in the development context. Open the data so everybody can have access to it. But openness can also come with a trade-off of financial sustainability, right? If you open the data without any restrictions whatsoever, it’s not that easy to profit from, to have a revenue from that. And in developing context, texts, the social economic security of companies and the people working on it is it’s a it’s a very big priority. But more importantly, so is if you manage to get people on board with this idea, you have to either come get everyone or no one to create a data commons because from the moment you get people certain amount of companies into the commons share the data for free, the companies that are private and have the size to to continue charging, like they’re private and have not opened their data and already have a big like it’s already big like a big big tech company, they’re going to have a financial advantage over whoever opened their data. And because their model will be more profitable, their model is also going to grow more, which could actually crowd out the commons the common data pool. So that is the challenge I see. It works a lot like a negative externality, you either get everyone on board or it’s hard to implement it. And but the second point is there is an attempt to to implement that kind of thinking actually in the European Union. Through the Data Governance Act, I wish I was the type of lawyer who was an expert in that I am not. But there is this regulation at the European Union level, the Data Governance Act, coupled with the Data Act, in which they try to create pan European pools of data sets in certain fields that are profitable, like agriculture, healthcare, mobility data. And the way they’re trying to build that is by kind of creating a compulsory licensing arrangement, like so the public sector can buy data sets that are of public value. from private entities, and the private entities, they’re mandated to sell it for a reasonable price. That’s literally what the law says, a reasonable price. So it’s somewhat of an attempt to do that, but still incorporate the cost structure of developing data sets, but whether it works or not, well, we’ll see. Yeah, thank you.
Moderator: Thank you. So I think we have two questions from the audience. Maybe we start with the madam in the front. Would you come to the microphone, please? And then we check quickly whether it’s on.
Audience: Can you hear me? Okay, thank you so much for the great session and presentations. I guess my question goes to both of you. You did speak when you were talking about the recommendation on domestically having the incentive coming from government. And I think Gustav online, he had also mentioned some of the large language models in Africa. I think you mentioned Lelapa AI in South Africa in Masakane. So my question is, do you see a lot of appetite, especially from governments to actually support a lot of these initiatives to develop our own local languages and having them included in these models? That’s the first one. And then the second one is just something that I’ve noticed from a very practical perspective where I’m originally from Zimbabwe, and for a learner to graduate and go to university, you must have passed mathematics and English. So not even our local language is included. So I was just thinking from an incentive perspective that why would I, if I’m a software developer, invest in local languages? when they are not useful to students and learners who are supposed to be using these technologies because you know you need English to go to university and all your products should be in English and you are making use of AI products in English. So I don’t know how you see it. Maybe you have practical examples from your regions where there is direct incentive from government financially to support a lot of these initiatives and as well with the school system, are there going to be any changes where we see more and more local languages actually being integrated for you to move to the next level? You must have at least passed a local language, not necessarily just English. So those are my two questions, thank you.
Nidhi Singh: Generate AI and then eventually get to building like translation softwares. As for the second question, that’s a very complicated question, I think which doesn’t maybe have a legal answer, it’s more for sociological answer. I know that some states in my country have a three language model where you must study three languages in school just because of how it works. I think this is just a general problem that many countries that have multiple languages have is the fact that the internet infrastructure and now increasingly AI is all really built on like one common thing which happened to be a specific dialect of English. Until you can have a lot more inclusivity in the room where these decisions are being made, that is unlikely to change. But I think that is something that you’re trying to do through these conversations is to just say that that actually doesn’t make any sense. If you just have a chatbot that really only recognizes English and like a specific way of accessing a service, then that’s not an accessible service. You need to have like more either human intervention and all of these problems. But yeah, I think that’s a more sociological problem that I think the world generally has like a majority perspective on thing and it’s not. I think that will really only get fixed when you have more voices in the room talking about what the experience is like.
Moderator: Okay, with the time in mind, we have one minute roughly left. Gustavo, would you want to add to that quickly?
Gustavo Fonseca Ribeiro: I think Niti’s answer was very good. So very quickly, government support, yes, you can see examples, for example, in Rwanda, the government has been quite supportive of developing datasets in Kinyarwanda, in partnership with academia and startups. And in Nigeria, you can also see the government supporting the development of a large language model in Nigerian languages. As for the second question on the education, I would say that Niti’s was on spot, I think Niti’s answer. I would add that the market is a powerful tool to drive development of AI. And in many countries that speak many, many languages, this market is in English, it is true. For example, in Kenya, that is the case, Uganda as well. You also have other purposes, right? Public services, for example, access of citizens to welfare, and research, which doesn’t necessarily have to have a commercial purpose. So in that context, I would say it’s quite relevant. I hope I’m touching on the question. And there was a question in the chat, which asked about the opportunities and risks of the localization, which is going beyond, it’s contextualizing the model as well, to the local culture, on the opportunities and risks. I would say very quickly. The opportunities? First, because it’s useful. People have a demand for solutions in their local language, and those tools work better for them. And the second, just cultural preservation, I think, will be an opportunity. And the risks, I would refer to risks at large of associated with AI. Even if you’re doing in a local language with culture, local culture embedded, you still have privacy risks. You still have bias risks. And, yeah. Thank you. I will give back the floor.
Moderator: So I saw that there was one in-person question at least left. Maybe I would ask you to, if Nidhi still stays here, you can also ask it after the session. I would like to thank our speakers for the very interesting insights. Also to the audience for the good questions. And, yeah. I wish you a few other good sessions today. And, yeah. Enjoy IGF. Thank you.
Nidhi Singh
Speech speed
180 words per minute
Speech length
3083 words
Speech time
1022 seconds
Exclusion of non-dominant dialects and languages
Explanation
AI models are primarily trained on specific dialects of English, excluding other languages and dialects. This leads to a lack of representation for non-native speakers and minority languages in AI systems.
Evidence
Example of universities using AI models to check for plagiarism, which are more likely to flag non-native English speakers.
Major Discussion Point
Impact of underrepresentation of languages in AI
Agreed with
Gustavo Fonseca Ribeiro
Agreed on
Underrepresentation of languages in AI has significant impacts
Exacerbates digital divide and loss of cultural identity
Explanation
The focus on dominant languages in AI systems widens the digital divide. It may lead to the loss of cultural identity as minority languages are increasingly removed from the internet and AI-generated content.
Evidence
Prediction that the internet may be filled with only one dialect of English in the future, crowding out other languages and dialects.
Major Discussion Point
Impact of underrepresentation of languages in AI
Need for detailed implementation of inclusivity frameworks
Explanation
While inclusivity is a key framework in AI ethics, its implementation requires more detailed guidelines. Simply making AI available in all languages is not sufficient without high-quality datasets for training.
Evidence
Example of chat GPT repeating Bollywood dialogues when used in smaller Indian languages due to lack of proper training data.
Major Discussion Point
Legal and ethical frameworks for AI inclusivity
Agreed with
Gustavo Fonseca Ribeiro
Agreed on
Need for legal and ethical frameworks to enhance AI inclusivity
Importance of transparency and accountability mechanisms
Explanation
Implementing inclusivity in AI requires strong transparency and accountability mechanisms. Without knowing how data is collected and models are trained, it’s difficult to push for meaningful inclusivity.
Major Discussion Point
Legal and ethical frameworks for AI inclusivity
State-driven initiatives for language inclusion
Explanation
Efforts for language inclusion in AI often come from domestic or state-driven initiatives. This is because states have better incentives to make AI accessible in local languages, even when it’s not financially lucrative.
Evidence
Example of Jugalbandi, a large language model for regional languages in India, developed in partnership with the state.
Major Discussion Point
International efforts to build more diverse AI
Agreed with
Gustavo Fonseca Ribeiro
Agreed on
Government support is crucial for local language AI development
Framing language inclusion as an accessibility issue
Explanation
Language inclusion in AI should be framed as an accessibility issue, similar to internet access. This approach would apply the same protections and requirements to AI as those used to spread internet access globally.
Major Discussion Point
International efforts to build more diverse AI
Economic incentives against data sharing
Explanation
The idea of a universal platform for data collection faces challenges due to economic incentives. Data is valuable, and those who possess it are unlikely to share it freely for others to profit from.
Evidence
Comparison of data to ‘the new oil’ and mention of economies built around data and data brokers.
Major Discussion Point
Challenges in creating universal data platforms
Differed with
Gustavo Fonseca Ribeiro
Differed on
Approach to data sharing and universal platforms
Personal data protection concerns
Explanation
The development of large language models raises concerns about personal data protection. There’s an increasing trend of accepting the use of personal data for training AI models, potentially diluting privacy protections.
Major Discussion Point
Challenges in creating universal data platforms
Need for more inclusive decision-making
Explanation
The lack of inclusivity in AI and internet infrastructure stems from a lack of diversity in decision-making processes. More diverse voices are needed to address the challenges of language inclusion in AI.
Major Discussion Point
Government support and incentives for local language AI
Sociological challenges in prioritizing local languages
Explanation
The prioritization of English in education and technology is a complex sociological issue. Changing this requires addressing broader societal norms and practices beyond just technological solutions.
Evidence
Mention of the three-language model in some Indian states as an attempt to address language diversity in education.
Major Discussion Point
Government support and incentives for local language AI
Gustavo Fonseca Ribeiro
Speech speed
142 words per minute
Speech length
2756 words
Speech time
1159 seconds
Affects cultural rights and socioeconomic benefits
Explanation
Underrepresentation of languages in AI impacts cultural rights protected under international law. It also affects the socioeconomic benefits that communities can derive from AI technologies.
Evidence
Reference to the International Covenant on Economic, Social, and Cultural Rights, mentioning three protected cultural rights.
Major Discussion Point
Impact of underrepresentation of languages in AI
Agreed with
Nidhi Singh
Agreed on
Underrepresentation of languages in AI has significant impacts
Creates opportunities for local AI companies
Explanation
The lack of language representation in AI creates opportunities for local companies to develop solutions. This can lead to the emergence of startups focusing on underrepresented languages.
Evidence
Examples of Ghana NLP, Lilapa AI, and Masakani Foundation working on language inclusion in Africa.
Major Discussion Point
Impact of underrepresentation of languages in AI
Copyright law and exceptions for data mining
Explanation
Copyright law plays a crucial role in regulating access to data for AI training. Exceptions to copyright for data mining, such as those in the EU AI Act, can facilitate the development of more inclusive AI systems.
Evidence
Mention of the data mining exception in the EU AI Act for non-commercial purposes.
Major Discussion Point
Legal and ethical frameworks for AI inclusivity
Agreed with
Nidhi Singh
Agreed on
Need for legal and ethical frameworks to enhance AI inclusivity
Potential role of traditional knowledge protection
Explanation
The concept of traditional knowledge protection in international intellectual property law could be applied to AI and data. This could help address issues of data ownership and benefit-sharing for communities.
Evidence
Reference to debates on traditional medicines and healthcare in the late 90s and early 2000s.
Major Discussion Point
Legal and ethical frameworks for AI inclusivity
Knowledge sharing and capacity building by international organizations
Explanation
International organizations can play a role in sharing knowledge and strategies for language inclusion in AI. They can also focus on capacity building to develop the necessary human talent for AI development.
Evidence
Example of the Karya platform in India, which could be shared with other countries facing similar challenges.
Major Discussion Point
International efforts to build more diverse AI
Balancing power in international policy conversations
Explanation
International organizations can help balance power dynamics in discussions about AI and language inclusion. They can provide forums for different actors to engage in face-to-face conversations, particularly between the global north and the global majority.
Major Discussion Point
International efforts to build more diverse AI
Tradeoffs between openness and financial sustainability
Explanation
Creating open data commons for AI faces challenges related to financial sustainability. Companies may be reluctant to open their data due to competitive advantages and the need for revenue streams.
Evidence
Discussion of the European Union’s Data Governance Act and Data Act as attempts to create pan-European pools of datasets while considering cost structures.
Major Discussion Point
Challenges in creating universal data platforms
Differed with
Nidhi Singh
Differed on
Approach to data sharing and universal platforms
Examples of government support in African countries
Explanation
Some African governments are actively supporting the development of AI models in local languages. This demonstrates growing recognition of the importance of language inclusion in AI.
Evidence
Examples of government support for AI language models in Rwanda and Nigeria.
Major Discussion Point
Government support and incentives for local language AI
Agreed with
Nidhi Singh
Agreed on
Government support is crucial for local language AI development
Market forces driving AI development
Explanation
Market forces play a significant role in driving AI development, often favoring dominant languages like English. However, there are other purposes for language inclusion, such as public services and research, which may not have commercial motivations.
Evidence
Examples of English dominance in markets in countries like Kenya and Uganda.
Major Discussion Point
Government support and incentives for local language AI
Agreements
Agreement Points
Underrepresentation of languages in AI has significant impacts
Nidhi Singh
Gustavo Fonseca Ribeiro
Exclusion of non-dominant dialects and languages
Affects cultural rights and socioeconomic benefits
Both speakers agree that the underrepresentation of languages in AI has substantial negative impacts on cultural rights, socioeconomic benefits, and digital inclusion.
Need for legal and ethical frameworks to enhance AI inclusivity
Nidhi Singh
Gustavo Fonseca Ribeiro
Need for detailed implementation of inclusivity frameworks
Copyright law and exceptions for data mining
Both speakers emphasize the importance of developing and implementing legal and ethical frameworks to promote language inclusion in AI, including copyright exceptions and detailed inclusivity guidelines.
Government support is crucial for local language AI development
Nidhi Singh
Gustavo Fonseca Ribeiro
State-driven initiatives for language inclusion
Examples of government support in African countries
Both speakers highlight the importance of government support in developing AI models for local languages, citing examples from India and African countries.
Similar Viewpoints
Both speakers recognize that while the underrepresentation of languages in AI can exacerbate the digital divide, it also creates opportunities for local companies to develop solutions for underrepresented languages.
Nidhi Singh
Gustavo Fonseca Ribeiro
Exacerbates digital divide and loss of cultural identity
Creates opportunities for local AI companies
Both speakers emphasize the need for transparency, accountability, and balanced representation in AI development and policy discussions, particularly between the global north and global majority.
Nidhi Singh
Gustavo Fonseca Ribeiro
Importance of transparency and accountability mechanisms
Balancing power in international policy conversations
Unexpected Consensus
Challenges in creating universal data platforms
Nidhi Singh
Gustavo Fonseca Ribeiro
Economic incentives against data sharing
Tradeoffs between openness and financial sustainability
Both speakers unexpectedly agree on the challenges of creating universal data platforms for AI, citing economic incentives and financial sustainability as major obstacles. This consensus is significant as it highlights the complexity of balancing open data initiatives with commercial interests in AI development.
Overall Assessment
Summary
The speakers show strong agreement on the impacts of language underrepresentation in AI, the need for legal and ethical frameworks, the importance of government support, and the challenges in creating universal data platforms. They also share similar viewpoints on the digital divide, opportunities for local AI companies, and the need for transparency and balanced representation in AI development.
Consensus level
The level of consensus between the speakers is high, with agreement on most major points discussed. This high level of consensus implies a shared understanding of the challenges and potential solutions in addressing language inclusion in AI. It suggests that there is a common ground for developing policies and initiatives to promote more inclusive AI systems across different regions and languages.
Differences
Different Viewpoints
Approach to data sharing and universal platforms
Nidhi Singh
Gustavo Fonseca Ribeiro
Economic incentives against data sharing
Tradeoffs between openness and financial sustainability
While both speakers acknowledge challenges in data sharing, Nidhi Singh emphasizes the economic disincentives for companies to share valuable data, whereas Gustavo Fonseca Ribeiro focuses more on the balance between openness and financial sustainability, citing attempts like the EU’s Data Governance Act.
Unexpected Differences
Overall Assessment
summary
The main areas of disagreement were subtle and primarily focused on the approach to data sharing and the specifics of legal and regulatory frameworks for AI inclusivity.
difference_level
The level of disagreement between the speakers was relatively low. Both speakers generally agreed on the importance of language inclusion in AI and the need for legal and regulatory frameworks to support it. Their differences were mainly in the emphasis and specific approaches they suggested, rather than fundamental disagreements. This low level of disagreement suggests a general consensus on the importance of the issue and the need for action, which could be beneficial for advancing policies and initiatives in this area.
Partial Agreements
Partial Agreements
Both speakers agree on the need for legal and regulatory frameworks to enhance AI inclusivity, but they focus on different aspects. Nidhi Singh emphasizes the need for detailed implementation guidelines beyond broad inclusivity frameworks, while Gustavo Fonseca Ribeiro discusses specific legal mechanisms like copyright exceptions for data mining.
Nidhi Singh
Gustavo Fonseca Ribeiro
Need for detailed implementation of inclusivity frameworks
Copyright law and exceptions for data mining
Similar Viewpoints
Both speakers recognize that while the underrepresentation of languages in AI can exacerbate the digital divide, it also creates opportunities for local companies to develop solutions for underrepresented languages.
Nidhi Singh
Gustavo Fonseca Ribeiro
Exacerbates digital divide and loss of cultural identity
Creates opportunities for local AI companies
Both speakers emphasize the need for transparency, accountability, and balanced representation in AI development and policy discussions, particularly between the global north and global majority.
Nidhi Singh
Gustavo Fonseca Ribeiro
Importance of transparency and accountability mechanisms
Balancing power in international policy conversations
Takeaways
Key Takeaways
Underrepresentation of languages in AI has significant impacts on cultural rights, socioeconomic benefits, and digital divide
Legal and ethical frameworks for AI inclusivity need more detailed implementation guidelines and transparency mechanisms
International efforts are needed to build more diverse AI, including knowledge sharing, capacity building, and balancing power in policy discussions
Creating universal data platforms faces challenges due to economic incentives and data protection concerns
Government support and incentives are crucial for developing AI in local languages, but sociological challenges persist
Resolutions and Action Items
International organizations should facilitate sharing of knowledge and strategies to enhance language inclusion across countries
Capacity building initiatives should be implemented to develop human talent for AI in diverse languages
Efforts should be made to bring better balance of power to conversations on international AI policy
Unresolved Issues
How to effectively implement inclusivity frameworks in AI development
Balancing openness of data with financial sustainability and market competitiveness
Addressing the lack of incentives for developers to invest in local language AI when education systems prioritize dominant languages
How to create universal data platforms that overcome economic and privacy challenges
Suggested Compromises
Using synthetic data generation to address data availability issues for underrepresented languages, while acknowledging potential limitations
Implementing copyright exceptions for non-commercial data mining to allow AI development while protecting intellectual property rights
Creating compulsory licensing arrangements for public sector to buy valuable datasets from private entities at reasonable prices
Thought Provoking Comments
Even if you do speak English, it’s not your dialect of English that goes in. So even within that, it’s only the version of English that’s most commonly present on the internet is something that’s being trained on. So in a sense, everybody’s sort of being excluded.
speaker
Nidhi Singh
reason
This comment highlights the nuanced issue of language bias in AI beyond just non-English languages, pointing out that even English speakers may be excluded if they don’t use the dominant dialect.
impact
It broadened the discussion from just underrepresented languages to issues of dialect and cultural expression within languages, leading to deeper exploration of inclusivity challenges.
As generative AI has gone up, universities have started using models to check if generative AI is being used, if students are cheating, if they’re using generative AI to turn in their homework or to write their papers. As a non-native speaker of English, even if you speak English with a high degree of proficiency, you are far more likely to be flagged for plagiarism.
speaker
Nidhi Singh
reason
This comment provides a concrete, real-world example of how language bias in AI can have serious consequences, especially in education.
impact
It shifted the conversation from abstract concepts to tangible impacts, prompting discussion on the ethical implications and potential discriminatory effects of AI in various sectors.
There exists this concept under international intellectual property, law of traditional knowledge. It protects traditional knowledge and traditional knowledge in the sense of knowledge that has been passed from generation to generation, for example, in traditional and indigenous communities.
speaker
Gustavo Fonseca Ribeiro
reason
This comment introduces a legal concept that could potentially be applied to AI and data rights, particularly for underrepresented communities.
impact
It opened up a new avenue of discussion on how existing legal frameworks might be adapted or applied to address issues of data ownership and cultural preservation in AI development.
Universal platform for data collection is a very interesting idea. I think that I’ve also heard a lot of conversations about AI Commons and data Commons. And yeah, the principle behind it is quite sound. Because basically what you’re saying is that you put all the data in one place so that everybody can benefit from it.
speaker
Nidhi Singh
reason
This comment addresses a potential solution to the problem of language bias in AI, while also acknowledging its challenges.
impact
It prompted a deeper discussion on the practical and economic challenges of creating inclusive AI systems, balancing idealism with realism.
Overall Assessment
These key comments shaped the discussion by expanding the scope of the conversation from simply underrepresented languages to issues of dialect, cultural expression, and real-world impacts. They introduced legal and ethical considerations, highlighted practical challenges, and prompted exploration of potential solutions. The discussion evolved from identifying problems to considering complex, multifaceted approaches to addressing language bias and inclusivity in AI development.
Follow-up Questions
How can synthetic data generation be used to address the problem of limited data availability for underrepresented languages?
speaker
Moderator
explanation
This is important to explore potential technical solutions for increasing language diversity in AI training data.
How can copyright laws be adapted to balance access to data for AI development with protection of intellectual property?
speaker
Gustavo Fonseca Ribeiro
explanation
This is crucial for determining how data can be legally used to train AI models while respecting copyright.
How does the concept of traditional knowledge under international intellectual property law apply to AI and data?
speaker
Gustavo Fonseca Ribeiro
explanation
This is an underexplored area that could have implications for how indigenous and traditional knowledge is protected and used in AI development.
How can personality rights be applied to AI training data, particularly when an AI is trained on someone’s voice or likeness but doesn’t directly replicate it?
speaker
Gustavo Fonseca Ribeiro
explanation
This is an unresolved legal question that has implications for data collection and AI training practices.
What are the benefits and challenges of creating a universal platform for data collection?
speaker
Kathleen Scoggin (online moderator)
explanation
This explores potential solutions for improving data diversity and accessibility for AI development.
How much appetite is there from governments to support initiatives developing local language AI models?
speaker
Audience member
explanation
This is important for understanding the potential for government-backed efforts to increase language diversity in AI.
How can education systems be changed to incentivize the development and use of AI in local languages?
speaker
Audience member
explanation
This addresses the systemic factors that influence language representation in AI development and use.
What are the opportunities and risks of localizing AI models to specific cultures and languages?
speaker
Audience member (via chat)
explanation
This explores the broader implications of developing AI models tailored to specific linguistic and cultural contexts.
Disclaimer: This is not an official record of the session. The DiploAI system automatically generates these resources from the audiovisual recording. Resources are presented in their original format, as provided by the AI (e.g. including any spelling mistakes). The accuracy of these resources cannot be guaranteed.
Related event
Internet Governance Forum 2024
15 Dec 2024 06:30h - 19 Dec 2024 13:30h
Riyadh, Saudi Arabia and online