Large Language Models on the Web: Anticipating the challenge | IGF 2023 WS #217

12 Oct 2023 01:30h - 03:00h UTC

Table of contents

Disclaimer: It should be noted that the reporting, analysis and chatbot answers are generated automatically by DiploGPT from the official UN transcripts and, in case of just-in-time reporting, the audiovisual recordings on UN Web TV. The accuracy and completeness of the resources and results can therefore not be guaranteed.

Full session report

Emily Bender

The analysis discussed various aspects of language models (LLMs) and artificial intelligence (AI). One key point raised was the limitation of web data scraping for training LLMs. Speakers highlighted that the current data collection for LLMs is often haphazard and lacks consent. They argued that this indiscriminate scraping of web data can violate privacy, copyright, and consent. Sacha Costanza-Chock’s concept of consentful technology, which emphasises meaningful opt-in data collection, was presented as a better alternative.

The speakers also stressed that LLMs are not always reliable sources of information. They pointed out that LLMs reflect biases of the Global North due to data imbalance. This uneven representation can lead to skewed outputs and perpetuate existing inequalities. Therefore, there were concerns about incorporating LLMs into search engines, as it could amplify these biases and hinder the dissemination of objective and diverse information.

Another topic of discussion was the risks associated with synthetic media spills. Speakers highlighted that synthetic media can easily spread to other internet sites, raising concerns about disinformation and misinformation. They recommended that synthetic text should be properly marked and tracked in order to enable detection and ensure accountability.

On the positive side, the analysis explored approaches to detect AI-generated content. Speakers acknowledged that once synthetic text is disseminated, it becomes difficult to detect. However, they expressed optimism that watermarking could serve as a potential solution to track AI-generated content and differentiate it from human-generated content.

In terms of reframing discussions, there was a call to shift the focus from AI to automation. By doing so, a clearer understanding of the societal impact can be achieved, ensuring that potential risks are thoroughly assessed.

Regarding language-related AI models, speakers emphasized the importance of not conflating them and carefully considering their usage in different tasks. This highlights the need for a nuanced approach that takes into account the specific capabilities and limitations of different AI models for various language processing tasks.

The analysis also emphasized the importance of communities having control over their data for cultural preservation. Speakers stressed that languages belong to their respective communities, and they should have the power to determine how their data is used. The ‘no-language-left-behind’ model, which aims to preserve all languages, was criticized as being viewed as a colonialist project that fails to address power imbalances and the profits gained by multinational corporations. It was argued that if profit is to be made from language technology in the Global South, it should be reinvested back into the communities.

In summary, the analysis delved into the complexities and challenges surrounding LLMs and AI. It highlighted the limitations of web scraping for data collection and the associated concerns of privacy, copyright, and consent. The biases in LLMs and the potential risks of incorporating them into search engines were thoroughly discussed. The analysis also examined the risks and detection of synthetic media spills, as well as the need for reframing discussions about AI in terms of automation. The importance of considering language-related AI models in different tasks and the control of data by communities were underscored. Criticisms were made of the ‘no-language-left-behind’ model and the profiting of multinational corporations in the Global North from language technology in the Global South.

Diogo Cortiz da Silva

The use of the web as a data source for training large language models (LLMs) has sparked concerns surrounding user consent, copyright infringement, and privacy. These concerns raise ethical and legal questions about the sources of the data and the permissions granted by users. Furthermore, there are concerns about potential copyright violations when LLMs generate content that closely resembles copyrighted works. Privacy is also a major concern as the web contains vast amounts of personal and sensitive information, and using this data without proper consent raises privacy implications.

In response to these concerns, tech companies such as OpenAI and Google are actively working on developing solutions to provide users with greater control over their content. These companies recognise the need for transparency and user consent and are exploring ways to incorporate user preferences and permissions into their LLM training processes. By giving users more control, these companies aim to address the ethical and legal challenges associated with web data usage.

The incorporation of LLMs into search engines has the potential to significantly impact web traffic and the digital economy. This integration raises policy questions regarding the potential risks and regulatory complexities of using LLMs as chatbot interfaces. As LLMs become more sophisticated, integrating them into search engines could revolutionise the way users interact with online platforms and consume information. However, there are concerns about the accuracy and reliability of LLM-driven search results, as well as the potential for biased or manipulative outcomes.

In addition to these concerns, the association of generative AI with web content presents challenges related to the detection, management, and accountability of sensitive content. Generative AI technologies have the capability to autonomously produce and post web content, raising queries about how to effectively monitor and regulate this content. Detecting and managing sensitive or harmful content is crucial in ensuring the responsible use of generative AI while addressing the potential risks associated with false information, hate speech, or illegal materials. Similarly, holding responsible parties accountable for the content generated by AI systems remains a complex issue.

To address these challenges, technical and governance approaches are being discussed. These approaches aim to strike a balance between innovation and responsible use of AI technologies. By implementing robust systems for content detection and moderation, as well as establishing clear accountability frameworks, stakeholders can work towards effectively managing generative AI-driven web content.

In conclusion, the use of the web as a training data source for LLMs has raised concerns regarding user consent, copyright infringement, and privacy. Tech companies are actively working on providing users with more control over their content to address these concerns. The integration of LLMs into search engines has the potential to impact web traffic and the digital economy, leading to policy questions about potential risks and regulatory complexities. The association of generative AI with web content raises queries about detecting sensitive content and ensuring accountability. Technical and governance approaches are being explored to navigate these challenges and foster responsible and ethical practices in the use of LLMs and generative AI technologies.

Audience

The discussion revolved around various topics related to the effects of generative AI and LLM (Large Language Models) development. Julius Endert from Deutsche Welle Academy is currently researching the impact of generative AI on freedom of speech. This research sheds light on the potential consequences of AI on individuals’ ability to express themselves.

The regulation of LLM development was also discussed during the session. The representative from META suggested that regulation should focus on the outcome of LLM development, rather than the process itself. This raises the question of how to strike the right balance between regulating the technology and ensuring positive outcomes.

The control of platforms and social media was another aspect of the discussion. It was noted that a few businesses have significant control over these platforms and the development of LLMs. This concentration of power raises concerns about competition and potential limitations on innovation.

The role of the state and openness in regulating LLMs was a topic of inquiry. The participants examined the role that the state should play in regulating LLM development and how to promote openness in this process. However, there was no clear consensus on this issue, highlighting the complexity of governing emerging technologies.

The discussion also explored the neutrality of technology, recognizing that different people have different values and use contexts for technology. It was acknowledged that technology is not inherently neutral, and its use and creation context vary among individuals and values.

Transparency in content creation by large language models was another area of concern. Unlike web page content and search engines, large language models lack clear mechanisms for finding and controlling content. This lack of transparency raises questions about the responsibility for the content created by these models and how stakeholders should be considered.

The discussion emphasized the need for the alignment of values in language models, with participation from different languages and communities. This inclusive approach recognizes the importance of diverse perspectives and ensures that the values embedded in language models reflect the needs and voices of various groups.

The notion of the internet as a ‘public knowledge infrastructure’ was also brought up, advocating for shaping the governance aspects of the internet to align with this goal. This highlights the need to democratize access to information and knowledge.

Furthermore, the economic aspects of content creation and the internet were given attention. It was noted that these aspects are often overlooked in discussions on internet governance. Participants argued for engaging in discussions about taxing and financing the internet and multimedia, particularly when creating new economic revenue streams for quality content.

These discussions provide valuable insights into the complexities and potential consequences of generative AI and LLM development. They underscore the importance of careful regulation, transparency, inclusivity, and economic considerations to ensure that these technologies are leveraged for the benefit of society. The discussions also highlight the significance of promoting openness and preserving freedom of speech in the digital era.

Dominique Hazaël Massieux

The analysis examines several aspects related to LLMs and web data scraping, content creation, AI technology, search engines, and accountability. It asserts that LLMs and search engines have different impacts when it comes to web data scraping. While web data scraping has been practiced since the early days of the internet, LLMs, being mostly a black box, make it difficult to determine the sources used for training and building answers. This lack of transparency and accountability poses challenges.

Furthermore, the analysis argues for explicit consent from content creators for the use of their content in LLM training. The current robots exclusion protocol is considered insufficient in ensuring content creators’ explicit consent. This stance aligns with SDG 9 – Industry, Innovation, and Infrastructure, suggesting the need to establish a mechanism for obtaining explicit consent to maintain content creators’ control over their materials.

In addition, the analysis proposes that the content used for LLM training should evolve based on regulations and individual rights. This aligns with the principles of SDG 16 – Peace, Justice, and Strong Institutions. It highlights the need for a dynamic approach to permissible content, guided by evolving regulations and the protection of individual rights.

The integration of chatbots into search engines is seen as a UI challenge. Users perceive search engines as reliable sources of information with verifiable provenance. However, the incorporation of chatbots, which may not always provide trustworthy information, raises concerns about the reliability and trustworthiness of the information presented. Striking a balance between reliable search results and chatbot integration is a challenging task.

Making AI-generated content detectable presents significant challenges. The process of watermarking text in a meaningful and resistant manner poses difficulties. Detecting and verifying AI-generated content is complex and has implications for authenticity and trust.

The main issues revolve around accountability and transparency regarding the source of content. The prevalence of fake information and spam existed before LLMs and AI, but these technologies amplify the problem. Addressing accountability and transparency is crucial in combatting the spread of misinformation and promoting reliable information dissemination.

The analysis emphasizes the benefits and drawbacks of open sourcing LLM models. Open sourcing improves transparency, accountability, and research through wider access to models, but the valuable training data that contributes to their effectiveness is not open sourced. Careful consideration is required to balance the advantages and drawbacks of open sourcing LLMs.

Lastly, more transparency is needed in the selection and curation of training data for LLMs. The value of training data is underscored, and discussions on transparency in data sources and curation processes are necessary to ensure the integrity and reliability of LLMs.

In conclusion, the analysis thoroughly examines various dimensions surrounding LLMs and their implications. It explores web data scraping, content creation, AI-generated content, chatbot integration, and accountability/transparency. The arguments presented call for thoughtful measures to ensure ethical and responsible use of LLMs in a constantly evolving digital landscape.

Rafael Evangelista

The analysis provides a comprehensive examination of the current landscape of online content creation and compensation structures. One of the primary concerns highlighted is the financial model that rewards content creators based on the number of views or clicks their content generates. This system often leads to the production of sensationalist and misleading content. The detrimental effects of this model were evident during the 2018 elections in Brazil, where far-right factions used instant messaging platforms to spread and amplify misleading content for profit. This example exemplifies the potential harm caused by the production of low-quality content driven by the pursuit of financial gain.

Another significant aspect discussed is the need to reconsider compensation structures for content creation. The analysis points out that many online platforms profit from journalistic content without adequately compensating the individuals who produce it. This raises concerns about the sustainability and quality of journalism, as content creators may struggle to earn a fair income for their work. The discussion calls for a reevaluation of the compensation models to ensure that content creators, particularly journalists, are appropriately remunerated for their contributions.

On a more positive note, there is an emphasis on acknowledging the collective essence of knowledge production and investing in public digital infrastructures. The analysis argues that resources should be directed towards the development of these infrastructures to support the creation and dissemination of knowledge. The knowledge that underpins large language models (LLMs/IOMs) is portrayed as a collective commons, and it is suggested that efforts should be made to recognize and support this collective nature.

However, there is also criticism towards the improvement of existing copyright frameworks. The distinction between fact, opinion, and entertainment is increasingly blurred, making it challenging to establish universally accepted compensation standards. Instead of bolstering copyright frameworks, the analysis recommends encouraging the creation of high-quality content that benefits the collective.

The analysis also highlights the potential negative impact of automated online media (AOMs), even in free and democratic societies. AOMs can incentivize the production of low-quality content, thereby hindering the quality and accuracy of information available online. To address this issue, the suggestion is made to tax AOM-related companies and utilize the funds to create public incentives for producing high-quality content.

In terms of governance, the analysis suggests that states should invest in developing publicly accessible AI technology. This investment would enable states to train models and maintain servers, therefore ensuring wider access to AI technology and its benefits. Additionally, there is an argument for prioritising state governance over web content functionality, as the web is regarded as something that states should take responsibility for.

The role of economic incentives in shaping the internet and web technology is highlighted, emphasising the influence of capitalist society and the need to please shareholders on internet companies. The analysis suggests viewing the internet and web through the lens of economic incentives to better understand their development and operation.

Finally, the importance of institutions in guiding content production is emphasised. The analysis posits that there is a need to regain belief in institutions that can hold social discussions and establish guidelines for content creation. The Internet Governance Forum (IGF) is specifically mentioned as a platform that can contribute to building new institutions or re-institutionalising the creation of culture and knowledge.

In conclusion, the analysis provides a thorough examination of the current state of online content creation and compensation structures. It highlights concerns regarding the financial model that incentivises low-quality content, calls for reevaluation of compensation structures, advocates for recognising the collective essence of knowledge production, criticises existing copyright frameworks, explores the potential negatives of AOMs, proposes taxation of AOM-related companies for public incentives, stresses the need for state investment in AI technology and governance over web content functionality, emphasises the role of economic incentives in shaping the internet, and highlights the importance of institutions in content creation. These insights provide valuable perspectives on the challenges and opportunities present in the online content landscape.

Vagner Santana

The analysis explored the concept of responsible technology and the potential challenges associated with it. It delved into various aspects of technology and its impact, shedding light on key points.

One major concern raised was the development of Web 3 and its potential to exacerbate issues related to data bias in technology. The analysis highlighted that machine learning models (LLMs) trained on biased data can perpetuate these biases, posing challenges for responsible AI use. Additionally, the lack of transparency in black box models, which conceal the data they contain, was identified as a concern.

The importance of language and context in technology creation was also emphasized. The analysis pointed out that discussions often focus on the context of creation rather than the diverse usage of AI and LLMs, particularly in relation to their potential to replace human professions. It highlighted how language and context significantly influence the worldwide usage and benefits of technology, with local conditions and currency playing a crucial role in determining access and usage of technological platforms.

The analysis advocated for moral responsibility and accountability in AI creation. It expressed concern that LLMs, with their ability to generate vast amounts of content, might be used irresponsibly in the absence of moral responsibility. It argued that technological creators should have a vested interest in their creations to promote accountability for AI-generated content.

There was an emphasis on the need to study technology usage to understand its real impact. The analysis acknowledged that people often repurpose technologies and use them in unexpected ways. It noted that the prevalent culture of “building fast and breaking things” in the technology industry leads to an imbalanced perspective. Thus, comprehensive studies are necessary to assess and comprehend the true consequences of technology.

The analysis highlighted the delicate balance between freedom to innovate and responsible innovation principles. While innovation requires the freedom to experiment, adhering to responsible innovation principles is essential to mitigate potential harm. It pointed out that regulations often emerge as a response to changes and issues stemming from technology.

The analysis acknowledged the non-neutrality of technology, recognizing that different perspectives arise from the lens through which we perceive and discuss it. It emphasized that individuals bring diverse values to the creation and use of technology, underscoring the subjective nature of its impact.

Furthermore, transparency issues were identified regarding web content and LLMs. The analysis noted that creative commons offer control mechanisms for web content, but there is a lack of transparency in large language models. This raised concerns about control mechanisms and participation in aligning these models, suggesting a need for greater transparency in this area.

In conclusion, the analysis emphasized the significance of developing and using technology responsibly to prevent harm and optimize benefits. It examined concerns such as data bias, language bias, transparency issues, and the importance of moral responsibility. The analysis also recognized the varied values individuals bring to technology and the importance of studying its usage. Overall, responsible technology development and usage were advocated as crucial for societal progress.

Yuki Arase

In the discussion, several concerns were raised regarding web data, large language models, chat-based search engines, and information trustworthiness. One major point made was that web data does not accurately represent real people due to the highly skewed nature of content creators. SNS texts from specific groups, such as young people, were found to dominate a significant portion of web data. This unbalanced distribution of content creators leads to biased representations and an overemphasis on particular perspectives. Furthermore, it was noted that biases and hate speech may be more prevalent in web data than in the real world, underscoring the issue of inaccurate representation.

Another concern addressed was the inherent biases and limitations of large language models trained on skewed web data. These models, which are increasingly used in various applications, rely on the information provided during training. As a result, the biases present in the training data are perpetuated by the models, resulting in potentially biased outputs. It was argued that balancing web data to accurately represent people from all around the world is practically impossible, further amplifying biases in language models.

The discussion also touched upon the impact of chat-based search engines on information trustworthiness. It was suggested that these search engines may accelerate the tendency to accept responses as accurate without verifying information from different sources. This raises concerns about the dissemination of inaccurate or unreliable information, as people may place unwarranted trust in the responses generated by these systems.

However, a positive point was made regarding the use of provenance information to enhance information trustworthiness. Provenance information refers to documenting the origin and history of generated text. By linking the generated text to data sources, individuals can verify the reliability of information provided by chatbots or similar systems. This approach can help increase trust in the information and mitigate the tendency to accept responses without verification.

The discussion also highlighted the impact of current large language models primarily catering to major languages, which could exacerbate the digital divide across the world. It was pointed out that training language models requires a substantial amount of text, which is predominantly available in major languages. Consequently, languages with smaller user bases may not have the same level of representation in language models, further marginalising those communities.

Lastly, the discussion mentioned the potential of technical solutions like watermarking to track the source of generated texts, a step towards ensuring accountability for AI-generated content. However, it was noted that the effectiveness of these technical solutions also depends on appropriate policies and governance frameworks that align with their implementation. Without these measures, the full potential of such solutions may not be realised.

In conclusion, the speakers highlighted several concerns related to web data, large language models, chat-based search engines, and information trustworthiness. The skewed nature of web data and biases in language models present challenges in accurately representing real people and avoiding biased outputs. The tendency to accept responses from chat-based search engines as accurate without verification raises concerns about the dissemination of inaccurate information. However, the use of provenance information and technical solutions like watermarking offer potential strategies to enhance information trustworthiness and ensure accountability. Additionally, the digital divide may worsen as current language models primarily cater to major languages, further marginalising communities using less represented languages. Overall, a comprehensive approach involving both technical solutions and policy frameworks is necessary to address these concerns and ensure a more accurate and trustworthy digital landscape.

Ryan Budish

Generative AI technology has the potential to bring about significant positive impacts in various sectors, including businesses, healthcare, public services, and the advancement of the United Nations’ Sustainable Development Goals (SDGs). One notable application of generative AI is its ability to provide high-quality translations for nearly 200 languages, making digital content accessible to billions of people globally. Moreover, generative AI has been used in innovative applications like generative protein design and improving online content moderation. These examples demonstrate the versatility and potential of generative AI in solving complex problems and contributing to scientific breakthroughs.

In terms of regulation, Meta supports a principled, risk-based, technology-neutral approach. Instead of focusing on specific technologies, regulations should prioritize outcomes. This ensures a future-proof regulatory framework that balances innovation and risk mitigation. By adopting an outcome-oriented approach, regulations can adapt to the evolving landscape of AI technologies while safeguarding against potential harms.

Building generative AI tools in a safe and responsible manner is crucial. Rigorous internal privacy reviews are conducted to address privacy concerns and protect personal data. Generative AI models are also trained to minimize the possibility of private information appearing in responses to others. This responsible development approach helps mitigate potential negative consequences.

An open innovation approach can further enhance the safety and effectiveness of AI technologies. Open sourcing AI models allows for the identification and mitigation of potential risks more effectively. It also encourages collaboration between researchers, developers, and businesses, leading to improved model quality and innovative applications. Open source AI models benefit research and development efforts for companies and the wider global community.

Ryan Budish, an advocate for open source and open innovation, believes in the benefits of open sourcing large language models. He argues that public access to these models encourages research, innovation, and prevents a concentration of power within the tech industry. By making models publicly accessible, flaws and issues can be identified and fixed by a diverse range of researchers, improving overall model quality. This collaborative approach fosters an environment of innovation, inclusivity, and prevents monopolies by a few tech companies.

In conclusion, generative AI technology has the potential for positive impacts in multiple industries. It enhances communication, contributes to scientific advancements, and improves online safety. A principled, risk-based, technology-neutral approach to regulation is vital for balancing innovation and risk mitigation. Responsible development and use of generative AI tools, along with open innovation practices, further enhance the safety, quality, and inclusivity of AI technologies.

Session transcript

Diogo Cortiz da Silva:
Good afternoon, good evening for everybody. Thank you to join us in this session about large language models and the impact on the web. And we plan to anticipate some questions. We are now in the last day of IGF, and we had a lot of sessions regarding generative AI. And this session is a little bit different because we will try to focus on some technical aspects and how generative AI, in general sense, could impact the web ecosystem. So when we planned this activity, we designed a structure in three main topics. One about data mining from web content. And we have a policy question for this. I read this. And so we have three main dimensions and three key policy questions that will guide our discussion here. But of course, we can go further on some aspects. And the first dimension is the web as data source for LLMs. And we have the policy questions. What are the limits of scrapping web data to train LLMs? And what measures should be implemented within a governance framework to ensure privacy, prevent copyright infringement, and effectively manage content creator consent? And we prepared these policy questions, I think, that four months ago. And since then, we see some work on this. For example, OpenAI and also Google, they create a way to block data mining. So it’s an approach to give user more control of their content. We have a second dimension that’s what happens if we incorporate generative AI, chatbots, on search engines. And for this dimension, we have the following policy questions. What are the potential risks and governance complexes associated with incorporated large language models into search engines as chatbot interfaces? And how should different regions? And for example, GlobalSoft respond to the impact on web traffic and consequently in digital economy. So if you have search engines replying directly to the query and not giving access or links to the original content. But we have a lot of other technical and ethical questions about this that can go further. And the third dimension is the web as the platform to post content generated by AI. And for this, we have the following policy questions. What are the technical and governance approach to detect AI-generated content posted on the web, restrain the dissemination of sensitive content, and provide means of accountability? And for this workshop, we have an excellent team of speakers from different backgrounds, from different stakeholder groups, and from different regions. We will have Professor Emily Bender from University of Washington that will join us online. We will have Wagner Santana from IBM Research. We will have Yuki Arasai from Osaka Universe that’s here in person. We will have Rian Budish from Meta that will join us online. We will have Dominic Rassel from W3C, the World Wide Web Consortium, that will join us online. And we’ll have Rafael Evangelista from the Brazilian Internet Steering Committee and Professor of University of Campinas that is also here. So I will start. Actually, every speaker will have 10 minutes for initial considerations. And we will start with Professor Emily Bender. Professor Emily Bender, thank you for joining us and accept our invitation. The floor is yours.

Emily Bender:
Thank you so much. Ohayou gozaimasu. I’m joining you from Seattle, where it is the evening. And I have prepared just a few remarks. And I’m hoping I can share my screen for some visual aids partway through. But I’ll try that when I get there. To the first question about the limits of scraping web data to train LLMs, I think it is really unfortunate that we have come around as a global society to a situation where the default seems to be if somebody can grab the data, it’s theirs. That doesn’t have to be the policy standpoint. But we have to take action if we want to change it. And what I would like to see it change to is what Sacha Costanza-Chock calls consentful technology, where the data is collected in a meaningful opt-in way, only with consent of the people contributing the data. And the benefit that will come with that is that such data collection has to be intentional. Right now, the data underlying LLMs is largely collected very haphazardly. The push has been to get the largest possible data set because that leads to more fluent output. That leads to output that can seem to speak to more topics. And so it’s just been, let’s grab everything we can. That hasn’t left room for documenting it so that we know what’s there. And it also hasn’t left resources or room for really building something that is representative of the world we would like to build. It’s also, incidentally, not representative of the world as it is because the internet, as we’ll see with my examples in a moment, doesn’t reflect a neutral viewpoint on the world. Moving on to the second question, what are the potential risks and governance complexities associated with incorporating LLMs into search engines? These are enormous. And it’s really important to understand that a large language model is not an information source. The information that is stored in a large language model is literally just information about the distribution of word forms in text. It’s not information about the world. It’s not information about people’s opinions about the world. It does include reflections of opinions in the form of biases that are expressed via the distribution of word forms in text. Thinking about the implications for the Global South in particular, and starting first with that idea of bias, here’s where I want to try to share my screen. Let’s see if this works. I teach with Zoom all the time, so it should work. Just going to be brave and share the desktop. All right. Do you see a tweet? Hopefully. This is an author advertising a preprint paper. And what they did in this paper was they looked at the ways in which mentions of people and places basically cluster together in very large-scale collections of text. They’re looking at Lama Tzu. And this was presented as though it were a world model, rather than just correlations that entities in the US tend to be mentioned in the same kinds of textual circumstances. What is particularly striking, actually, about this graphic is just how sparse the data is in the Global South. And so we are getting lack of representation and then misrepresentation because we are relying on these data sets that heavily weight the gaze of the Global North. And that’s a big problem. The other thing that I wanted to show you has to do with pollution of the information ecosystem. So as we let these synthetic media machines just spill their synthetic text into the web, it doesn’t stay contained as the output of ChatGPT, but it moves from location to location. I tested this today. It is unfortunately still true. If you put in the Google search query, no country in Africa starts with K, which isn’t even a question, but it’s a search query, out comes this false snippet. While there are 54 recognized countries in Africa, none of them begin with the letter K. And then it nonsensically continues, the closest is Kenya, which starts with a K. And where did this come from? So this is Google search. I’m not even using BARD here, but this is Google search taking a snippet from the first hit for this query, which is this page called Emergent Mind, where some developer has chosen to post the output of ChatGPT. I don’t know who this person is. I don’t know why they chose to post this thing. But somebody decided to give ChatGPT the input, did you know that there is no country in Africa that starts with the letter K? ChatGPT is designed to provide outputs that human raters say this is good. In other words, it’s designed to output a sequence of text that reads as what you want to hear. And so ChatGPT replies, yes, that’s correct, and then continues with that same string that we saw Google pulling up as its snippet for the search result. So there’s two big problems here. One is we have the output of the synthetic media machine that looks like very fluent English, and so it sort of slides in with other kinds of information. And the other is that our information ecosystem, just like a natural ecosystem, really is this interdependent collection of sites. And the synthetic text doesn’t stay quarantined to where it was output. I’ll stop the share there so that I can see my notes when you can’t. I want to move on to point C here. The question is, what are the technical and governance approaches to detect AI-generated content posted on the web, restrain the dissemination of sensitive content, and provide means of accountability? So technically speaking, with the synthetic text that we have now, this cannot be detected after the fact. It has to be marked at the source, and that means watermarking. That is not impossible. There’s really interesting work, for example, published at ICML this year for very clever ideas about how to put in watermarks in synthetic text that would be hard to detect and remove. But honestly, even something that is relatively easy to remove would be an improvement. Because if we have watermarks, then the default use case would contain the watermarks, and we could filter the synthetic text. And just like oil spills in the natural ecosystem, synthetic media spills in the information ecosystem are a situation where less pollution is better. Even if we can’t get rid of all of it, it’s worth designing policies to minimize it. So I really think we need policy action here, and we can’t just pin our hopes on some technological solution that would allow us to detect this stuff after the fact. So I think that is everything I plan to say. I want to make sure there’s time for everyone to speak. I look forward to learning from you all. Thank you.

Diogo Cortiz da Silva:
Thank you, Professor Emily Bender, for your considerations. And now I invite Wagner Santana from IBM Research. Wagner, the floor is yours.

Vagner Santana :
Thank you. I’ll try to share my screen just a second. I’ll need to quit and try again. Sorry. So we’re waiting for Wagner to join us online. So we move to Professor Yuki.

Yuki Arase:
So thank you for inviting me to this exciting panel. So my points are quite largely overlapped with what Emily just said. But for the first point, first question, the limitations of scraping the web data to train large language models is that we should be aware that web data never represents people in the real world. It is highly skewed in many ways due to unbalanced distributions of content creators. For example, now, SNS texts occupy a large portion of web data, which come from mostly a specific group, particularly young groups of people using SNS. And also, like, social biases or even hate speeches can be more significant in the web data than what we really see in the real world. And there are a large amount of automatically-generated content, including noisy or even toxic ones in the web data. So web data can never be balanced to equally represent people in the world. And large language models trained on such data inevitably inherit the same feature or same trend or same characteristics of such web data. So it won’t be perfect, like the correct or trustworthy model as it is. So we should be aware of that. And for the second point, what are the potential risks and governance complications associated with incorporating large language models into search engines? So I think one of the serious concerns is that chat-based search can be too handy for people to use, which may accelerate the tendency to accept a response as correct or trustworthy without looking up different sources of information. So as I just said, the web data does not represent real people. And the web data, sometimes, there is a lot of wrong information. So large language models trained on such data has the same trend. So there’s no doubt. So the search is now our lifeline. And its advancement is really appreciated. But we must ensure the way to access various sources of information so that we can check the information is trustworthy or it’s something we should believe. So for this, I think it’s a good way to address this problem is to have a way to link the generative text to some kind of data sources, which allows us to understand what information these texts are based on. So as our group, we have been working on this kind of problem. So natural language processing can somehow help to identify alignments between generated and text in the real world. Such kind of provenance information gives us a chance to step back and think, wait, is this chatbot response really trustworthy or not? So another concern is that the current large language models cover mostly major languages because they are data-hungry and require a large amount of text for training. So text data of such scale is available only for major languages. And besides the evaluation and benchmark data sets that we are heavily rely on to developing such large language models, also concentrate on major languages. So yeah, so this trend may hinder the expansion of the technology to regional or local languages. which may worsen the digital divide across the world. So we should explore the way to train large language models in a data-efficient way and cover various languages and cultures and so on. So for the third question, what are the technical and governance approaches to detect AI-generated content? I think, yeah, I was about to refer to the same paper just Emily mentioned, like watermarking for the generated text. This is a technical way so that we can track down who generated, for which model generated such text. But as Emily said, this is just a technical solution and we need a policy or governance to really work with such kind of technology, really work in the world. So that’s all from my side, thank you. Thank you.

Diogo Cortiz da Silva:
Thank you, Professor Yuki. And now I invite Ryan Budish from Meta. Thank you, Ryan, to join us. Thank you very much. Before I start, did you want to go back? I see Wagner is back on. I know we skipped him, so I just wanted to make sure that. Ah, okay, Wagner is back, right? No, so, yeah. I’m here. So can you try to, you are going to share your screen, Wagner? Yes.

Vagner Santana :
Okay, so let’s try. Can you see my screen? Okay. So now it’s okay, so. Thanks, I’m sorry for the previous situation. Well, I prepared a few slides just to try to delve into the questions presented by Diogo, but under the lens of this idea of thinking about the context of creations of technology and the context of use of technology. Well, as Diogo mentioned, we’re thinking about scraping web data, privacy, copyright, and also the use of LLMs to search engines, different regions, and how the digital economy may be impacted, and also the whole idea of detecting AI-generated content, dissemination accountability. For the first point, I like to think about how we came here, right? So first we had the web one, then web two, the social web, and then now with blockchain, with the promise of providing more trust. But if we pause here and think about LLMs, now we’re having this data used to train models, and then we have this black box without transparency about the data that is inside, and how is it going to be this web three plus with data? And the concerning thing about is how is it going to be when it started to be retrained on the data that it’s using, right? What are the biases? And this has already been mentioned by other panelists that we have bias, we know that, and how is this going to be amplified by this, the way we have? And we have approaches like the robots TXT file to block, but that shouldn’t be the default, right? Capture anything that you have to block for someone to not use your contents, right? So, and we also can start thinking about machine learning attacks. People can start creating pages just to poison LLMs that are going to be trained on those datasets. So these are some of the aspects that I wanted to bring first. And moving to the second question, well, I often see this discussion about humans and humans substituted, replaced by humans plus AI, humans plus LLMs, right? And again, back in web two, we had just content creators creating content, e-commerce platforms, and then the consumption by people, social, and then conversions, and then this coming back to the platform and part to the creators, right? And nowadays with adding LLMs to this equation, we have this idea of having LLMs with the content creators creating maybe more content. And then there’s this whole promise of increasing productivity. Again, we have platforms, but now the consumption is not only by people. We have also robots consuming that to their own interests, right? Conversion, and then this will come back in some form and distributed. But back to this idea of replacement, I’d like to bring a discussion around. Usually we have this examples of certain categories and the one that I like to explore is about attorneys, for instance. So there’s this whole discussion that attorneys are going to be replaced by attorneys that use LLMs. But if you think about the language that these LLMs are being trained and the laws, and usually, so those platforms are paying US dollars, right? I’m located in New York, TJ Watson Lab, but I’m originally from Brazil and the currency impacts a lot on how you use those platforms, right? So this is one aspect I wanted to bring. And the thing is these discussions usually about replacement are closest to the context of creation of technology, right? People that speak the same or the most used language on those data sets used for training, right? And moving towards the third question, and we already saw some of these aspects. I think that one aspect that I wanted to emphasize here is the idea of the accountability and generated content and the understanding of how technology works, right? It predicts the next word. And if we get this into scale, we have really large content being created, right? And I like to see that and discuss about that being as a one way. And if we try to get, for instance, a reverse prompt in which we give a content and ask for a prompt, you will not get this. It is not trained for that. And it has no means of getting the input back. And there’s this whole idea of understanding the limitations and in Responsible AI, there’s this question around moral entanglement in which we should have technology creators of being morally entangled with the data and the technology they create, right? And I would also expand that to the content because nowadays we’re seeing people using some LLMs in a not better way, not right way. I brought some examples that I saw in a prompt engineering course of ways that people present on how to use large language models and here some quotes on creating blogs and answering social media about things that people don’t know, right? So I think that we should have also this idea of moral entanglement for content that also people create. And the thing is that we have this huge technology that is consistent and predictable and it’s hard to cover all possible outcomes of context. So this idea of techno-solutionism and techno-centrism brought us here, right? I brought really just an outline of the Responsible Inclusive Framework that we proposed in our team. The R&I Framework, it brings some discussions around context of creation versus context of use and this distance, how this distance can be concerning and a notion of stakeholders that goes to self-business up to society. And here, the idea of presenting this picture is that… Okay, no. And well, in context of use, we have creation of technology, prototyping, development, training and deployment. What we have in context of use, sorry, we have users, tasks, equipment and social and physical environments and all the possible variations of those, right? So we have really complex situations. So it’s nearly impossible to predict all possible contexts before and after, even after deployment. Let’s think here, some examples, people riding bikes while using mobile phones, right? And for developers, it’s hard for us to think about tasks and ways of people using mobile phones while riding bikes. And imagine this last one with six mobile phones and riding a bike. And here we can see that it’s the same app, Pokemon Go, but imagine that we are in a context that people are using six different LLMs interacting while riding bikes. It’s impossible to predict all of these possibilities. So why does distance matter? Because we have, the higher the distance, the more impersonal technology is. And that’s what we see nowadays. Technologies created in one region, used in all around the globe, a lived experience for people creating technologies are different from the ones impacted by these technologies, right? And this culture of build fast, break and fix, which is often popular, it influences in this impersonality for technology. And there’s also imbalance in terms of perspectives, considers. And unfortunately, the ones with power to compete, understand and promote changes are very few. So to conclude, without studying how technology is used, we are hindered for the real impacts. And our premises when creating technology are limited in terms of coverage of possible contexts. And we need more ways of covering all these possibilities, diverse teams and all the things that we already know. But there’s one interesting aspect is that people repurpose those technologies. We have been repurposing technology since web one, right? And some people use that in a really good way. So we need to empower those users, but also to prevent harmful aspects. And there’s this whole idea that innovation may need freedom to experiment, right? But also responsible innovation teaches us that we need avoid harms, do good, and implement governance to make sure that these two things are happening at the same time, right? And we see that usually we have regulations reporting to changes. And I think that this is one interesting way of starting and starting a change and responding to the things that we are seeing out there. Thank you.

Diogo Cortiz da Silva:
Thank you, Wagner, to bring your considerations from the industry perspective. And now I invite Ryan from ETA to also share inputs from the industry. Thank you, Ryan. The floor is yours. Great, thank you so much. I’m thrilled to be here.

Ryan Budish :
I’m coming from Boston, Massachusetts, where it is quite late at night. So I’m going to try not to speak too loudly because my kids are sleeping in the room next to me, but let me know if you can’t hear me. So I wanna start by taking a step back. And I think that it’s still, even though it doesn’t feel like it some days, it’s still very much early days for generative AI technologies. And I think what these technologies might look like as they unfold is still a bit fuzzy, but it isn’t hard to imagine some of the huge positive impacts that they could have for businesses, large and small, for healthcare, for the delivery of public services, for advancing the UN Sustainable Development Goals and much more. And I think a lot of people, maybe they think about AI chat bots or some of the really fun generative AI tools, like some of those that Meta announced just a few weeks ago. But before getting into these questions, I just wanted to mention a couple of uses of large language models that we’ve developed that I think highlight some of the tremendous opportunity here. One area is translation, and we’ve published groundbreaking research and shared models for translation, such as our No Language Left Behind model and our Universal Speech Translator models. No Language Left Behind, NLLB, is a first of its kind AI research project that open sources models capable of delivering high quality translations directly between nearly 200 languages. Because high quality language translation tools don’t exist for hundreds of languages, billions of people can’t access digital content or participate fully in online communications and communities on the web in their preferred or native languages. And tools like NLLB can help address some of that. And when comparing the quality of translations to previous AI research, the NLLB 200 models scored an average of 44% higher, and it was even significantly more higher than that for some African and Indian-based languages. And we’re also developing this Universal Speech Translator, where the innovation there is that it can translate from speech in one language to another in real time, which is something that can work even where there is no standard writing system. And that’s really important because when you think about how a lot of language translation models work, particularly speech-based ones, they start with speech, translate it to text, translate the text from one language to another, and then translate that back in, transform that back into speech. And that breaks down if you don’t have a standard writing system in the middle there. And so something like Universal Speech Translator can help address that. And eliminating language barriers could be a profound benefit, making it possible for billions of people to access information online, across the web in their native or preferred languages. And we’ve also made other large language models available to researchers and have seen really tremendous research and innovation there, including like our OPT175B model, which has been used for all kinds of interesting applications, like generative protein design to improving content moderation tools online. And so I think that there is really a potential for immense benefits of these large language models on the web. But at the same time, there’s also undoubtedly risks and problems. And like any technology, an LLM itself is not inherently good or bad, but the critical question is what is it used for? And I think AI technologies and LLMs can drive progress on some of the most pressing challenges that we’re facing today. So when we think about governance, we have to strike a balance between mitigating these potential risks, particularly from high risk applications, while ensuring that we can continue to benefit from innovation and economic growth. And as we’ve heard already a couple of times today, in order to build these large language models and to have these benefits that they’re able to potentially bring, the volume of material required to train them is almost incomprehensible in scale. We’re talking hundreds of millions and sometimes billions of pieces of information is required to train a large language model. And in order to build these groundbreaking tools and have the training data necessary, many companies have to use data from a wide variety of sources, including data publicly available from across the internet. And the sheer scale of these systems is partly why these issues that Diego has teed up, rightly so, is why they’re so important and so complex. So on the first question, the piece that I wanted to dive into, to at least to start with, is about… is about privacy. And I want to talk about some of the ways that we’re trying to develop these technologies in a safe and responsible way with respect to privacy. I think we know we have a responsibility to protect people’s privacy. And we have teams dedicated to this work for everything we build, including our generative AI tools. A few weeks ago, for instance, we announced a bunch of exciting new generative AI products. And privacy was really important for how we develop those features with a variety of important privacy safeguards to protect people’s information and to help them understand how these features work. Our generative AI features go through rigorous internal privacy review process, for example, which helps us ensure that we’re using people’s data responsibly while building better experiences for connection and to help people express themselves online. For publicly available information, for example, we filtered the data set to exclude certain websites that commonly share personal information. And importantly, we didn’t train these models on people’s private posts. And for publicly shared posts on things like Instagram and Facebook, they were a part of the data used to train generative AI tools. And we train our generative AI models to limit the possibility of private information that one person may share while using a generative AI feature from appearing in responses to other people. Now, on the second question, this is something that we think a lot about how we can build these tools so that they can benefit everyone, including people in the global South. And one important way that we’re trying to do this is by making AI technologies more accessible to more people. We’ve been very public about our views on open source, most recently releasing Llama 2 and Code Llama models. And we do this because we believe that the benefits of AI should be for the whole of society, not just for a handful of companies. And we believe that this approach can actually make AI better for everyone. With thousands of open source contributors working to make an AI system better, we can more quickly find and mitigate potential risks in systems and improve the tuning to prevent erroneous outputs. And the more AI-related risks are identified by a broad range of stakeholders, including researchers, academics, policymakers, developers, and other companies, then the more solutions that the AI community, including tech companies, will be able to find for implementing guardrails to make these technologies safer. And an open innovation approach also has economic and competition benefits. I mean, LLMs are extremely expensive to develop and train. And that’s why, increasingly, AI development and major discoveries happen in private companies. But with open source AI, anyone can benefit from the research and development, both within companies, but also across the entire arena, across the entire global community of developers and researchers. And this is something we’ve experienced firsthand in other contexts. Our engineers, for example, developed an open source frameworks that are now industry standards, like React, which is a leading framework for making web and mobile applications, as well as PyTorch, which is now the leading framework for AI. And so now, on to the third question. Meta has learned from a range of experiences, both positive and negative, over the last decade. And we’re using these lessons to build safeguards to our AI products from the beginning, so that people can have safer and ultimately more enjoyable experiences. I think it’s important, when we talk about watermarking, particularly for something like text, that our view is that generative AI doesn’t help bad actors spread content once it’s created. Bad actors can really only spread problematic content, whether AI-generated or not, through known tactics, like fake accounts or scripted behavior. And this means that we can actually continue to detect malicious attempts to spread or amplify AI-generated content, using many of the same behavioral signals that we already rely on. And we know that generative AI can help bad actors create problematic content. So we have teams that are constantly working to get better at identifying and stopping the spread of harmful content. And we’re actually optimistic about using generative AI tools themselves to help us enforce our policies. And this issue is not unique to META. It’s a concern across industry. And that’s why META and many of our industry peers voluntarily joined the White House commitments that include a commitment about watermarking AI content that would otherwise be indistinguishable from reality. But make no mistake, this is a deep and significant technical challenge. And currently, there really aren’t any common standards for identifying and labeling AI-generated content across industry. And we think there should be. And so we’re working with other companies through forums like the Partnership on AI in the hope of developing them. And so what should governance of this technology look like? And I think that we support principled, risk-based, technology-neutral approaches to regulation of AI. We think that measures should not be focused on specific technologies, such as generative AI. Instead, our view is that regulation should be focused on the what, the outcomes that regulation wants to achieve or prevent, rather than the how. We believe that this approach is more future-proof and helps strike a better balance between enabling innovation while continuing to help us minimize the risks. So with that, I’ll stop there. So thank you.

Diogo Cortiz da Silva:
Thank you, Ryan. Now we move to technical community. And I invite Dominique from W3C, the World Wide Web Consortium, to join us. Hi, everyone. Thank you, Diogo, for the invitation.

Dominique Hazaël Massieux:
Just a quick few words about what W3C is and maybe why I’m here. So W3C is a worldwide web consortium and why I’m here. So W3C is one of the leading standard organizations for web technologies. And in particular, in W3C, I’ve been in charge of developing our work on bringing machine learning technologies to web browsers, which has led me to look at the broader impact of AI on web content. So to the three questions that were raised for this panel, the first one around the limits of scraping web data. So I think it’s interesting when you look at that question and you look at what exists today, scraping web data is something that probably started from the very early days of the web that has been a critical component of one of the tools we all rely on, which, of course, are search engines. And so one of the questions I wanted to raise is how do LLMs and search engines differ in terms of scraping web data and why should they be handled differently? And I think one of the clear answers to that has already been alluded to. Search engines today fulfill a role of intermediation between content creators and content consumers, where content creators can expect something back in the form of a link back to the original content. If you look at an LLM, in most cases, and maybe this will change, as others have said, this is a very fast evolving space. But today, an LLM is mostly a black box. You get an answer, but you don’t know the sources from the training of that LLM, and you don’t know exactly which sources were used to build such an answer. And some of it is structural to the technology itself. It’s not just a limitation. Part of what an LLM does is compress all this information they gathered across a whole corpus of data they collected. So given the fact that copyright itself was, at least from my understanding, always building as a trade-off between incentivizing content creation and making sure the content would get widely published and distributed, I think the fact that LLMs today have, to say the least, an unclear story about how they consider the copyright of the content they integrate in their training, I think to me there is here a really fundamental question. Understanding, indeed, whether it’s permissible for LLM to use any kind of available text and data for that training, or whether, as Professor Bender said, this needed a lot more explicit consent from the content creators. And my perspective is that, indeed, the current robots exclusion protocol, which is really about excluding crawlers, not saying anything about what the crawling data should be reused for, is not a sufficient mechanism to ensure the explicit consent of content creators. We need something a lot more robust and a lot more opt-in rather than opt-out from my perspective. I think the question about privacy is also interesting. Again, if you think about the search engine comparison, something that has emerged over the past few years is the so-called right to be forgotten, where, at least in some regions, search engines have been mandated to remove content that is private of nature. And of course, there is also some controversy about the feasibility of this request and the overall impact on the information space. But if you think about that particular question and LLMs, again, is it even feasible today to untrain such a specific part of an LLM that could have been learned over data that would have otherwise been removed from the public information space? I mean, to me, that illustrates some of the really tricky questions. No matter how careful the training might have been, the data might have been created, it assumes that this is a static set of permissible data when, in fact, what is permissible has to evolve over time based on evolution of regulations, based on evolution of individual rights, and so on. So I guess, to me, the answers to what are the limits, they are, to me, pretty large. I think there needs to be a significant rethink of how training should be done. And of course, there is a lot of value in having a lot of text to create some of the really impressive output that LLMs have been able to bring. But that cannot be at the expense of making sure that, in particular, content creators get the incentive to continue to create and publish that content. Because otherwise, at the end of the day, of course, there won’t be anything left for LLM to build on if content creators stop publishing their data, no matter what. In terms of the questions around the complexities of incorporating chatbots into search engines, some of the main points, I think, have already been made. I mean, to me, one of the critical points, again, was made by Professor Bender, mixing something that users have approached as a source of reliable information with checkable provenance with something that is not meant as a tool of necessarily trustable information or checkable information is a really challenging UI question. Typically, probably not a good idea, although there could be protections around it. Sorry, it’s 3am here, so the brain is still a bit waking up. And the fact that these interfaces are really sleek, in a way, makes the problem even more damning. But in terms of the complexity of the governance question, I think we are dealing with the questions we’ve seen emerge again and again. What are the limits that can be put into things that are primarily products and user interface or even user experience considerations? I think we all agree that there has been a lot of value in allowing a lot of innovation, a lot of competition in that space. And so there are limits to what governance, external governance can impose in that space. We are seeing some evolutions in this limits with some of the regulations, for instance, emerging with the Digital Service Act in the EU. But to me, there is something here structural in terms of governance, that is, who should have a say about what gets exposed in a search engine interface. And even if some of this may or may not be a good idea, who is going to be at the table to participate to these conversations? I don’t think it’s a simple question. Again, there’s a trade-off between enabling new ideas, new interfaces, new interactions, and making sure we don’t weaken some of these tools that have become structural, systematic in their importance, I think is something that we are going to be facing for the years to come. But again, in terms of one of the impact that I think we need to keep repeating in the importance of the web ecosystem, the fact that today LLMs don’t generate backlinks, they generate digested, compressed content, is something that further goes against the grain of the role of search engines, not only social role, but also economical role of search engines, which, again, typically have operated with the notion that they serve as this intermediary between content creators and content consumers. Finally, on the third question around the approaches to making AI-generated content detectable, there is definitely a challenging technical question. How do you watermark text in a way that is meaningfully detectable and resist to changes? And the latter, I think, points maybe to the more structural issue to me in that space is that some content that gets released and published is purely AI-generated, and LLMs allow to provide scale and possibly, unfortunately, some level of trustworthiness in the sense that they provide very sleek outcome. But increasingly, my guess at least is that LLMs will be used not just as pure generators, but as authoring tools, something that help people create content, not just create content that gets released as is. And so when you get into that mode, it’s no longer a binary, yes, this was created by AI versus this was created by human. I expect a lot of content that we will see in the years to come will be hybrid content with AI having either provided a fast version, having provided corrections to existing content, or even a more iterative process between human and AI-generated content. And how do you… such a content, even without thinking of what or marking what kind of metadata could be used to reflect this I think is, to say the least, challenging. Of course, the need to mark at least purely generated AI content I think remains important and worthy of addressing itself. And I would say it’s probably even worth addressing for LLM trainers themselves. If you’re training your LLM on generated content, you’re going to create likely a lot of drift in the quality of the training over time so there is value in being able to either exclude or at least treat differently such content. But at the end of the day, I think the real question that this particular trend of AI generated content is bringing even more strongly to the surface is one of indeed accountability and transparency about the source of content. So, for example, fake information, fake news haven’t waited for LLMs to emerge, the content for spamming, the farms for spamming content haven’t waited for LLMs either. LLMs are very likely going to bring a different scale to the issue and so that doesn’t address the problem, but to me I think it’s really important we address the broad issue as the issue about how do we get as a society to managing this different level of quality of content, the notion of who is responsible for content that gets published, and that we take into account the impact that LLM brings to the scale of that issue, but I doubt that focusing specifically on LLM or AI generated content is the right framework for the discussion. I think the real critical gap I’m seeing in terms of governance here is one that I think this very panel is trying to address. I think we have a lot more structured conversations between technologists, between research, between regulatory bodies in structuring this space. So far, it’s a lot, it’s way too much siloed conversations among our own small communities, having places, having opportunities more than a panel, really day long conversations about how do we, with our various stakeholders, with our various perspectives on the problem space, come to a set of, if not solutions, at least directions, at least places for experimentation that cross these barriers across technology and regulation, I think is really the critical piece because until these silos remain, then the gaps between these conversations are the places where the things we don’t want to appear are going to thrive. Thank you.

Diogo Cortiz da Silva:
Okay. Thank you, Dominique, for your contribution. And now I move to Rafael from the Brazilian Internet Steering Committee and also professor at the University of Campinas in Sao Paulo, Brazil. Thank you, Rafael.

Rafael Evangelista:
Thank you, Diogo. Firstly, I would like to thank you for the invitation and congratulate the organizers for the quality of the questions presented in this panel. However, I must say I won’t be able to address the complexity of all the issues mentioned in the activity description. One pressing concern I would like to address is the proliferation of low-quality content on the Internet, and the root of this issue, in my opinion, is the financial model that underpins much of the web’s content creation. The digital advertising ecosystem, which rewards content creators based on the number of views or clicks, has inadvertently incentivized the production of sensationalist or even misleading content. This is particularly evident in Brazil, where such content has not only misled the public, but also has posed significant threats to the democratic process. A case in point is the 2018 elections, during which certain far-right factions adeptly utilized instant messaging groups to disseminate and amplify online content. This content was then monetized either directly through the platforms or indirectly via digital advertising. And something similar happened in the context of the 2016 US elections, where the actions of Macedonian groups seeking economic gains are well documented. From the perspective of the developing nations or the so-called global north, these practices might seem distant or even improbable. However, the reality in the global south, characterized by stark economic disparities and significant currency fluctuations, paints a different picture. There, many individuals, including young professionals, find themselves resorting to producing subpar or misleading content as a viable means of income. This trend isn’t limited to mainstream platforms. Even alternative media outlets, which traditionally championed unbiased and independent reporting, are succumbing to the allure of increased clicks and the subsequent revenue. The overall quality of content produced in Portuguese, speaking of the case of Brazil, has dropped considerably due to the perverse economic incentives for web publishing. The advent of large language models further complicates this landscape. There is a growing concern that LLMs might exacerbate and spread low-quality information. To counteract this, we must re-evaluate and overhaul the existing compensation structures governing web content production. The current business models, especially those of major big tech platforms, have inadvertently skewed the balance, often to the detriment of genuine, high-quality cultural and informational content. In my capacity as a board member of CGI.PR, we have dedicated time and effort to discuss potential legislative actions to curb that scenario. Our primary aim is to find ways to reallocate the enormous wealth accumulated by major technology corporations to fund better quality content. We believe that these resources can be instrumental in promoting and sustaining high-quality, diverse and inclusive journalism, which is crucial for a well-informed society. Our team is not just looking for short-term solutions. Instead, we are determined to craft a strategy that can overcome the prevailing marketing incentives, which, more often than not, tend to favor quantity over quality. A substantial part of our discussions focus on how journalists and content curators can be fairly compensated for their work. Many suggestions on the table are rooted in copyright claims. The core argument here is that many online platforms are reaping significant profits from journalistic content without providing just compensations to those who produce it, which is similar to what is happening with the LLMs. Interestingly, this debate parallels the discussions about the training of artificial intelligence systems, especially when it comes to the use of vast amounts of data without proper acknowledgment of or compensation. While I personally find these arguments compelling and worth considering, the field of journalism introduces its own set of complexities. One of the most pressing issues is defining the boundaries of what truly qualifies as journalistic content and what not. The blurred lines between opinion, fact, and entertainment content make it a daunting task to set universally accepted compensation standards. I believe that the solution isn’t merely to bolster existing copyright frameworks. Instead, we should focus on cultivating an environment that encourages the creation of high-quality content that benefits the collective. In the realm of journalism, this could manifest as public funds sourced from tech giants, but managed transparently and democratically, dedicated to promoting quality journalism. Implementing such mechanisms won’t be without its challenges, especially when it comes to defining quality journalism and safeguarding it from undue external influences. The challenges posed by IOMs are analogous. Take, for example, Cielo, a digital library that offers open access to scientific journals. Initially a Brazilian initiative, it now boasts participation from 16 countries, predominantly Portuguese and Spanish speakers. With over 1,200 open access journals, it’s a treasure trove of information readily available to IOMs for training purposes. This represents a significant public investment from the global south, which is now being harnessed to train technologies predominantly controlled by a select few corporations. In my view, the answer is not to restrict access to such invaluable resources, nor is it feasible to compensate every individual author of these scientific papers directly. Many of these authors are already compensated by their academic institutions to produce publicly accessible knowledge. It’s essential to recognize that while IOMs might be the brainchild of major corporations, the knowledge that fuels them is derived from a collective commons. Thus, our governance solutions should pivot away from individualistic compensation models. Instead, we should champion initiatives that acknowledge the collective essence of knowledge production and channel resources towards bolstering public digital infrastructures. In the sense, IOMs are used as public digital infrastructures. Along with these public digital infrastructures, we need to establish governance and financing mechanisms that ensure the fulfillment of public and democratic interests. It seems clear that the technological and financial difference between companies from the global north and the global south creates a situation where only states have a realistic capacity to compete. The web, with its open and collaborative nature, was an infrastructure that excited everyone at the beginning of the 21st century due to the possibilities of producing free and accessible cultural commons. However, social media platforms soon emerged with their walled gardens, blocking content interoperability and privately appropriating collective production. IOMs represent a new chapter in this challenge. They appropriate not only the expressed content, but also the ways we express ourselves, the form used to express ourselves. While IOMs undoubtedly bring benefits and have many uses, leading to their rapid adoption when used in the context of weakly regulated advertising and surveillance markets formed by distorted economic incentives, they become tools for further production of low-quality content. Thank you.

Diogo Cortiz da Silva:
Thank you, Rafael. Thank you all the speakers for initial remarks. We have different inputs from different stakeholders, and now I open the discussion. So, I invite the audience, both in person and online audience, to ask questions. And also, I invite the speakers to comment on the content discussed here. So, we have two questions, three questions here, four questions on site. So, I think that we can run the mics. Yeah, it’s better than go there, I think. No, I think that we can run this mic here.

Audience:
Yeah, my name is Julius Endert from Deutsche Welle Academy, German public broadcaster. So, I would like to connect what you said. So, we are also trying to find out how the effects of especially generative AI on the freedom of expression is. So, will it be a tool for allowing more people to express them freely, or will it maybe, on the other hand, be the opposite and that we see new limitations, especially in unfree media systems and surroundings and authoritarian regimes, or the effect also on the public discourse? That is my question. So, what is the effect on the freedom of speech and the public discourse, to make it shorter? Okay, so I think that you can reply now. Yeah.

Rafael Evangelista:
As I was trying to say, I don’t think that, of course, authoritarian countries represent a different challenge, more like a specific context. But I think that what I was trying to express is that AOMs will not only be used in this context, but even in the democratic context, in free countries, we have this bunch of incentives for the production of low-quality content. And I think the AOMs will be used for that. And the thing that I think could be useful to combat that, to try to avoid those things, is to understand that we have to tax the companies and to use that funds to create public incentives, to produce content that is of quality and regulated or governed by public institutions that can be democratic. I think that’s it.

Diogo Cortiz da Silva:
Thank you. So we move to the second question. It was over there. You can go to the mic there, I think.

Audience:
Hello, my name is Teo from the University of Brasilia. I’m starting my question with the point that the representative from META just brought up, which is the idea that you don’t regulate the form or the process, but you regulate the product, the outcome. And I’m wondering, we’re talking about this in the context of very few businesses, just as platforms and social media are controlled by the same few businesses that control the development of LLMs, and not even states can compete with the development and the pace of LLM development. What is the role then, what would be realistic roles for the state and for the role of openness in this scenario, considering that also openness is co-opted by the same platforms to develop their models? I wonder what your views are on the state and the openness models.

Diogo Cortiz da Silva:
Okay, so I think for these questions I invite… Ryan, to reply, and then I open to all the speakers to comment, okay? Ryan, are you there? Yeah, yeah.

Ryan Budish :
Yeah, so, I mean, thank you for this question. I mean, I think that it’s important to, you know, in some ways, I would push back a little on the framing of the question, because I think the, when you look at the companies that are developing these large language models, they are actually quite different, and have rather different business models and incentives, and, you know, and so I can, you know, speak for Meta and our view, and, you know, as I said in my prepared remarks, you know, that we believe very strongly in open source and open innovation, and that’s actually something that we believe not only will help improve the quality of the models, and improve the safety of the models, but will also help ensure that this isn’t just a domain of a handful of tech companies. You know, when you think about how difficult and expensive it is to train the models, you know, if the only options that are available, if you’re a small business or a researcher that wants to use a large language model, and the only options that are out there are proprietary models that you have to pay for, then you end up with a situation where there’s potentially a race for the bottom, where people choose, you know, cheaper, low quality models, or maybe they try to build their own models, and, you know, and maybe, you know, there’s a lot of challenges there as well, and so one of the things that we think about since is that by open sourcing many models and making them available, that we’re actually able to help support a lot of good research and good innovation in businesses by making it possible for people to have access to many high quality models, and so for us, I think it’s not about gating access to these models. It’s actually about how do we enable more people to take advantage of these models, and then to be able to make them better and build on them and innovate, and when researchers find flaws or issues with the models, and then those can then be fixed, pulled back into the models, and then those fixes can be shared by everyone who’s building on top of those models, so anyway, those are some of my thoughts on the openness piece of it.

Diogo Cortiz da Silva:
I open to all the speakers if you want to follow up on this question. Emily Bender, yeah, you can go, Emily, please. Actually, I have a comment on a different topic,

Emily Bender:
so I’ll wait. Yeah, I was going to comment on this topic, so if I may then.

Dominique Hazaël Massieux:
Yeah, so just reacting on what Ryan was sharing about open source as a potential solution in that space, so first, absolutely, the more open source we get on these models, I think the better in terms of transparency, accountability, research improvements, and distributions, indeed, of the benefits of LLMs, but I think there is a critical aspect of LLMs that makes open source a bit of a mixed story. You get open source access to the code that is, or to the models that are generated by the training, but you don’t get access, you don’t get open source access to the training data, which are clearly where the gist of the value of these models are. So, really, it’s only half open in that sense, and given all the stakes there are in terms of selection, curation of the data, the fact that, I mean, for understandable reasons, those training data are not part of the opening makes it, I think, an imperfect answer to the question of openness, and there are discussions that I think need to be had about transparency around training data sources and the curation process that has accompanied these sources, but until we are having this conversation, I don’t think that open sourcing’s resulting model is a sufficient

Diogo Cortiz da Silva:
answer to this desire of openness. Thank you, Dominique. Emile, you?

Emily Bender:
Yeah, so on a slightly different topic, I want to say that all of these discussions become clearer if we stop using the phrase artificial intelligence or AI, because it’s not well-defined. We should talk in terms of automation and then talk about what’s being automated, and as we talk about language models in particular, it is, I think, unhelpful to conflate things like the use of language models as a component in automatic transcription or automatic translation systems, and their use to generate synthetic media. Those are different tasks. They do happen to rely on the same trained models, but they’re being used very differently, and so from a governance perspective, I think it’s important to keep it straight. While I’m talking, I want to call out the fact that the no-language-left-behind model for MEDA is a very colonialist project. I believe that languages belong to their communities, and that means communities should have control over what happens to data in their language. They should have control over what kind of technology is built, and if there’s profit to be made from building that technology, it should be fed back into those communities. I think this is an extremely important point for people from the Global South. It is not right for multinational corporations in the Global North to be profiting off of language technology from Global South communities. Thank you. Thank you, Emily. Rafael, do you want

Rafael Evangelista:
to comment something? Just to add to the question made by Teo, I think that you said that states don’t have the conditions to compete with those. I think if they are really invested in creating something that can be used by the public, it can be like the word open source has been used here, and it’s really hard to define what it means, because it can be like it can use a license that is really free or can use a license that just, okay. But the point is, my point is, I think if the states recognize that the web is something that they should care for, and if these tools to produce content is something that should be really accessible and controlled by the states or the communities or the public, they can invest and not only train models, but have servers and have, because there’s a lot of costs, and I think it’s not really realistic to think of Global South companies trying to do that, but the states, or at least the bigger states of the Global South, we can think of the BRICS countries, etc.

Diogo Cortiz da Silva:
Okay, Wagner, you want to comment on this? Then I move to you.

Vagner Santana :
Yeah, I have quick comments around the idea of technology as being not good or bad. I think that that starts the discussion around a neutrality of technology, and I think that that connects a little bit of the discussion tried to bring on the context of creation and use, and how this is different, different people, different values, and that is not true, right? It’s not neutral, at least the lens that we apply to this discussion, and it’s interesting how we’re discussing about the content of web pages, and if we connect really simply with different contents, like a media or code, we need to express, or we have mechanisms for control, like creative commons, how to use, how to use, how to redistribute, and for a large language models, this was just take for granted and for gathering data, right? And when we compared and contrast with search engines, we discussed, we had a link back, we had ways of finding content, now it’s for generating content, and the content creators, we don’t have transparency on that, and how the stakeholders related to this very content that’s being created are being considered, right? So just wanted to discuss, and to the idea of languages, I totally agree with Professor Bender, and there’s the whole discussion on value alignment, who is aligning those models, right? If we’re talking about different languages, different communities, are they participating on these alignments, right? So, yeah, thanks. Thank you, Wagner. We have one comment here, please.

Audience:
Thank you very much. My name is Peter Puck, I’m the chairman of the World Summit Awards, and we have started in 2003 to look at and show in which way ICTs are used for creation of quality content, and over the last 20 years as part of the WSIS process, the World Summit Awards have created a library of about 12,500 examples of high quality or higher quality content projects, products, and initiatives, and about 1,600 winners on that level. I want to first start and congratulate the organizers of this session, because I think this has been one of the most substantial sessions of IGF this year, and I want to stress that very much. It has been exceptionally good, and I want to also make sure that you see that the value you give to the different kind of aspects, having somebody from Meta, having somebody from the World Wide Web Consortium, having different views, and also from academia and technology community, is really valuable. I want to stress a number of points and then come to a question. One of the points is I really appreciate Emily Bender’s point on looking at the colonialism with technology, especially when we look at the effects the platform intermediaries have had on the internet, and I want to just reiterate what I said in some other contexts here at the IGF. It is that we have actually, with the platform intermediaries, replaced through the internet and cannibalized the editorial intermediaries, and that is something really very, very key to the question also of the large language models which are creating new intermediary structures, and that I think is very important. The other thing is that I thought that Wagner’s insistence and bringing up this issue of the studying technology in the context of use and how people repurpose technology in multiple ways, I think that’s a very valuable, interesting, let’s say culturalist attitude towards the technology, but then the question is does he have actually examples of how large language models can actually do that and how they structure and so on, and I think there’s a lot of interesting aspects in this. My main point would, however, be on the issue of the question of we are looking at the web as a public information infrastructure, and that is something which is only, let’s say, part of the picture which is underlying the governance imperative to the internet. I would think that the governance imperative and the goals and aspects should go towards a public knowledge infrastructure, and that relates very much to the question of how to finance it, and when we come to this model of the journalism, the model of the journalism is actually a model of creation of, let’s say, having these two markets of having advertising and subscription, and now we need to go into looking at what are the economics actually of this new public knowledge infrastructure, and one of the criticisms which I have of IGF conversations and sessions is that the economic side is, let’s say, very much or largely ignored, and I want to thank very much here, Raphael, for bringing up this issue of the economics of content creation and how we do this, and I would be happy to engage at other fora on the issue of how to tax this and how to work this. I think Deutsche Welle is a very good example of how one can say that there is somebody who is really moving into the multimedia space in a very interesting way, combining public broadcasting model together with a creation of many different kind of knowledges, but my question would be in which way can we continue within the IGF this kind of conversation regarding, for instance, creating new economic revenue streams for quality content as part of a governance imperative for the internet. I hope that this has been a very clear question. Thank you very much for giving me this space. Thank you for your comments and your question, and Raphael, do you want to start?

Rafael Evangelista:
Yeah, thank you for your comments, and really insightful, and I think that we have to recognize that the internet doesn’t live by itself in a separate realm or something like that, it’s like we live in a capitalistic society, and this just drives the companies, and they can say they can have ethical worries and guidelines, but we know that at the end of the day, the thing that is most important is to please the shareholders, etc., and I think we have to look at the internet and the web with the lens of what are the economic incentives that are playing for content or for the development of technology, how this drives, so we have to, I think it’s important for, I think IGF can be part of that, to build new institutions or re-institutionalize the creation of culture, of knowledge, etc., like regain the belief in institutions that can socially discuss some guidelines for this kind of production, and to put much of our resources on this kind of institutions. Thank you. Any other speaker want to comment on this? Because we are running out of time, so we

Diogo Cortiz da Silva:
do not have time for more questions, so I would like to thank you, all the speakers and the audience, to join us today. We are in the beginning of a new era, and we are raising new questions, and I’m sure next IGF you will be here again discussing maybe the same topics, but with more information, and of course asking different new questions. So thank you all, and the session is closed. Thank you.

Audience

Speech speed

153 words per minute

Speech length

1004 words

Speech time

395 secs

Diogo Cortiz da Silva

Speech speed

134 words per minute

Speech length

1149 words

Speech time

516 secs

Dominique Hazaël Massieux

Speech speed

133 words per minute

Speech length

2253 words

Speech time

1016 secs

Emily Bender

Speech speed

178 words per minute

Speech length

1587 words

Speech time

536 secs

Rafael Evangelista

Speech speed

124 words per minute

Speech length

1690 words

Speech time

818 secs

Ryan Budish

Speech speed

162 words per minute

Speech length

2359 words

Speech time

876 secs

Vagner Santana

Speech speed

160 words per minute

Speech length

1878 words

Speech time

703 secs

Yuki Arase

Speech speed

146 words per minute

Speech length

735 words

Speech time

303 secs