Driving Social Good with AI_ Evaluation and Open Source at Scale

20 Feb 2026 14:00h - 15:00h

Driving Social Good with AI_ Evaluation and Open Source at Scale

Session at a glanceSummary, keypoints, and speakers overview

Summary

The panel examined how open-source software projects can remain maintainable and trustworthy as large language models (LLMs) and agentic AI increasingly generate code contributions. Sanket Verma introduced his NumFOCUS role and framed the discussion around the emerging “AI slot PR” phenomenon and the need for new safeguards and policies [1-2][6-7][151-181].


Mala Kumar described “AI red teaming” as a structured, contextual evaluation method that brings domain experts together to probe model failures, emphasizing that Humane Intelligence plans to release its red-team tooling under an open-source licence [14-18][33-34]. Tarunima Prabhakar added that open-source solutions are crucial for resource-constrained regions such as India, where shared evaluation stacks can prevent duplicated effort across organisations [40-45].


Sanket highlighted the vital role of community contributions in sustaining scientific libraries [46-49] and recounted two recent incidents: an OCaml pull request of 13 000 lines generated by a chat-GPT interaction that burdened maintainers [152-168], and an agentic AI-generated PR to Matplotlib that was rejected and led to a brief controversy [173-179]. He argued that clear policies on non-human contributions are essential to protect maintainers’ limited capacity [178-181]. Mala warned that undisclosed AI-generated code erodes credentialing systems and obscures provenance, complicating reviewer workloads [187-194][195-197].


To scale red-team efforts, the panel suggested ontological mapping of problem spaces to create representative prompts, especially multilingual ones, and to combine automated prompt generation with human oversight [290-295][300-304][326-328]. They cautioned that using the same LLM as judge can amplify bias, underscoring the need for spot checks by subject-matter experts [330-334][263-266][278-281].


Overall, the participants agreed that open-source AI evaluation tools are still in their infancy, requiring robust standards, human-in-the-loop safeguards, and community-driven policies to ensure sustainable, safe development of AI-enhanced open-source projects.


Keypoints


Major discussion points


Open-source AI evaluation and red-team­ing as a community effort – The panel highlighted that AI red-team­ing (structured adversarial testing) is being open-sourced to broaden access, and that vibrant open-source communities are essential for supplying data, techniques, and sustained maintenance of evaluation tools [9-21][32-34][46-49][66-70].


Maintainability and policy challenges posed by AI-generated contributions – Real-world examples (a massive OCaml PR generated by ChatGPT and an agentic AI submitting a pull request to Matplotlib) illustrate how LLM-driven code submissions increase reviewer workload, raise questions of provenance, and expose the need for clear contribution policies at both project and organizational levels [151-179][184-188].


Standardisation of evaluation artefacts and benchmarks – Participants argued for interoperable “eval-cards” or model-cards to enable reproducible assessments, but noted current practices are ad-hoc and especially difficult across multilingual, multicultural contexts; the lack of clear problem definition often leads to mis-aligned benchmarks [98-103][135-140][340-357].


Making evaluation tools usable for non-technical stakeholders – NGOs and program staff often lack engineering capacity; the discussion stressed that evaluation work must be accessible beyond developers, with clear guidance, human-in-the-loop checks, and documentation so that domain experts can safely deploy AI systems [115-118][236-244][278-281].


Opportunities to automate and scale parts of the evaluation pipeline – Ideas such as using LLMs to map large codebases, generate scenario prompts, apply ontological modelling, or even have models red-team other models were presented as ways to reduce manual effort while still retaining critical human oversight [229-234][290-295][317-321][313-316].


Overall purpose / goal


The panel aimed to explore how the open-source ecosystem can responsibly support the evaluation, red-team­ing, and maintainability of AI/LLM systems. It sought to identify current challenges (e.g., AI-generated pull requests, lack of standards, multilingual safety) and to propose community-driven safeguards, policies, and tooling that lower barriers for contributors, NGOs, and other stakeholders while ensuring safe, reliable AI deployments.


Tone of the discussion


The conversation began with an informative and collaborative tone, as speakers introduced their backgrounds and the concept of open-source AI evaluation. As the dialogue progressed, the tone shifted to concerned and problem-focused, highlighting concrete maintenance headaches and policy gaps caused by AI-generated contributions. Toward the end, the tone became optimistic and forward-looking, emphasizing opportunities for automation, community-driven standards, and inclusive participation. Throughout, the panel maintained a constructive, solution-oriented atmosphere.


Speakers

Sanket Verma – Board of Directors, Numfocus; Technical Committee member, Numfocus; open-source maintainer and advocate for AI/LLM maintainability and policy development. [S4]


Mala Kumar – Representative of Humane Intelligence; former Director at GitHub (4 years); focuses on AI red-team­ing, open-source evaluation tools, and benchmarking frameworks. [S5]


Ashwani Sharma – Engineer with experience at Google; speaker on open-source community building, multilingual AI evaluation, and the intersection of open-source and agentic AI. [S6]


Tarunima Prabhakar – Works at TATL (Technology for the Global Majority); focuses on online harms, open-source AI safety, and building open products for global-majority geographies such as India. [S1][S2]


Audience – Members of the summit audience (industry, academia, non-profits, government) who asked questions about risks of open-source AI scaling, benchmarking, and red-team­ing.


Additional speakers:


None (all speakers in the transcript are covered by the list above).


Full session reportComprehensive analysis and detailed insights

The panel opened with Sanket Verma introducing himself as a NumFOCUS board member and technical-committee participant, noting that NumFOCUS fiscally sponsors core scientific libraries such as NumPy, SciPy, Pandas and Matplotlib [1-4]. He framed the discussion around the emerging “AI slot PR” phenomenon-large-language-model-generated code submissions-and asked the audience to consider how maintainability, safeguards and policies must evolve in this new era [6-7][151-181].


The conversation was organized around three topics: (1) Evaluation & open-source software, (2) Red-team scaling & open-source tools, and (3) Agentic AI & open-source projects. Mala Kumar defined AI red-team­ing as a structured, contextual evaluation method that assembles domain experts to devise adversarial scenarios and probe model weaknesses, rather than relying on generic benchmarks [14-20]. She announced that Humane Intelligence will release its red-team tooling under an open-source licence later in the year, thereby widening access to rigorous safety work [33-34]. Tarunima Prabhakar added that open-source guardrails are especially vital for resource-constrained regions such as India, where sharing evaluation stacks prevents duplicated effort across organisations [40-45].


Sanket emphasized that the scientific stack’s vitality depends on a vibrant contributor base that supplies data, techniques and ongoing maintenance [46-49]. Ashwani Sharma illustrated this with the Indic LM Arena, a community-driven effort that adapts the LA Marina benchmark for Indian languages and invites further contributions to improve multilingual evaluation [66-70].


Sanket recounted two recent incidents that expose the maintenance burden of AI-generated pull requests. In the OCaml project a single PR added roughly 13 000 lines of code produced by ChatGPT, overwhelming maintainers who had to question the author’s intent and ability to fix downstream bugs [152-168]. A similar episode occurred with Matplotlib, where an agentic AI submitted a massive PR, was rejected for lacking a non-human contribution policy, posted a critical blog, then retracted after dialogue [173-179][180-182]. Mala warned that undisclosed AI code erodes credentialing systems, obscures provenance and forces reviewers to expend disproportionate effort [187-196][195-197]. Ashwani noted that “AI slop” PRs have proliferated during events such as Hacktoberfest, prompting community pleas for governance measures [198-208]; the Codot library was cited as ranking top among these low-quality AI-generated PRs, with its maintainers asking GitHub to intervene [198-208].


These stories led to a consensus that clear policies are needed to manage non-human contributions. Sanket called for project-level and umbrella-level guidelines, while Mala pointed out that GitHub is actively discussing the addition of a label to identify AI-generated PRs [180-182][187-196]. The panel agreed that explicit labelling, provenance tracking and reviewer safeguards are essential to protect over-stretched maintainers [151-181][187-196].


Standardising evaluation artefacts was identified as a way to improve reproducibility. Mala suggested developing an interoperable “eval-card” analogous to model cards, enabling users to upload a specification and replicate the same evaluation across contexts [98-103]. She cautioned that current benchmarking practices are ad-hoc, especially across multilingual settings, and that without a well-defined problem space benchmarks can mis-measure the wrong phenomenon [340-357].


Human-in-the-loop oversight was repeatedly stressed. Mala contrasted “additive” Western software architecture with the “reductive” approach common in India, arguing that AI evaluation resembles the latter-we must knock out unsafe behaviours rather than build layers from scratch [80-88][89-91]. Tarunima gave a concrete example: a service for HIV survivors wishes its chatbot to discuss sexual health, yet many foundation models flag such dialogue as unsafe, illustrating that universal safety filters may conflict with local needs [124-130]. Both Mala and Tarunima warned that LLMs used as judges inherit the same biases as the models they evaluate, so spot-checks by subject-matter experts remain indispensable [324-328][326-328].


To scale red-team­ing, the panel discussed several complementary techniques. Mala advocated an ontology-based mapping of problem domains (e.g., human-rights clauses, demographic groups) to generate representative prompts and ensure reproducibility [290-295]. Tarunima described using LLMs to auto-generate multilingual prompts from thematic inputs, noting that as LLM capabilities improve, automated prompt generation is expected to become more reliable, though human validation remains crucial for low-resource languages [296-304]. Ashwani highlighted clustering of model outputs to surface distinct behavioural classes that merit focused testing [313-316]. Sanket introduced the idea of model-to-model red-team­ing, where one LLM attacks another, potentially automating vulnerability discovery [317-321].


When the audience asked about benchmarking, Mala reiterated that benchmarks should be built after red-team insights to target the correct failure mode, using a clear definition of what is being measured (e.g., hallucinations in Yoruba vs bias in Hausa) [340-352][355-357]. The audience also raised concerns about the risks of “open-weight” (open-model) systems versus open-source software, prompting Mala to distinguish the two: open-source software concerns code transparency and maintenance, whereas open-weight raises separate data-access and model-distribution risks [257-260]. She responded that open-sourcing evaluation tools is low-stakes and largely beneficial, though it is important to prevent non-experts from adjudicating specialised domains [262-266][274-276]. This highlighted a modest disagreement on the perceived risks of open-source scaling.


Sanket suggested that LLMs could map large codebases, visualising functions, data flows and class relationships to help newcomers identify entry points for contribution [229-234], linking this to broader efforts to lower onboarding barriers for massive projects such as NumPy or Matplotlib [227-233]. The panel encouraged participants to engage with community initiatives like the Indic LM Arena and the forthcoming Humane Intelligence red-team suite [33-38][66-70].


The discussion highlighted shared viewpoints that open-source AI evaluation tools, community-driven contributions and human-in-the-loop oversight are essential for safe, sustainable AI development. Points of contention were limited to the perceived risks of open-source scaling and the degree of automation appropriate for red-team pipelines. Unresolved issues include establishing enforceable policies for AI-generated PRs, creating maintainable benchmark frameworks for low-resource languages, and balancing LLM judges with human checks. Action items emerging from the session are: (1) Humane Intelligence’s planned open-source release of its red-team software later this year [33-38]; (2) development of an interoperable “eval-card” standard [98-103]; (3) community-led mapping of large codebases to aid onboarding [229-234]; and (4) continued contribution to regional projects such as the Indic LM Arena to strengthen multilingual evaluation capacity. The panel concluded by underscoring the need for continued collaboration across open-source communities, NGOs, academia and industry to develop sustainable, context-aware AI evaluation practices.


Session transcriptComplete transcript of the session
Sanket Verma

Hello everyone. So my name is Sanket Verma and I serve on the board of directors of Numfocus. Numfocus is a non -profit organization based out of US which is a fiscal sponsor for all the foundational projects used in the AI like NumPy, SciPy, Pandas, Matplotlib. I also serve on the technical committee of Numfocus. I’ve been in the open source space for the last decade. I maintain open source projects and all that stuff. So my focus will be what does the maintainability look like in the age of LLMs and AI. And I think our community has been handling these AI slot PRs for quite some time and it’s about time we start thinking what does it look like, what kind of safeguards should be there, what kind of policies should be there.

And just to make sure that I’m not interrupting you, I’m going to go ahead and start the recording. not sound too pessimistic, there are opportunities as well, like how these agentic AINLMs can be used to lower the barrier for the newcomers and contributors, how they can leverage it.

Mala Kumar

It’s on, but the button’s not illuminated, so very confusing. Great. So again, we have three topics that we’re going to cover in this panel, and I guess we’ll go ahead and kick it off on the first one. So the first topic is really around the idea of evaluation and open source software. At Humane Intelligence, we do focus on what we call contextual evaluations, so we’re not going to the hyper -automation that a lot of companies like to look at. We don’t also focus on benchmarks, which is kind of the industry darling. What we really focus on is AI red teaming, which is kind of a remnant thing from cybersecurity, where you would basically bring a bunch of people together to try to hack away at whatever tool that you’re building.

With AI red teaming, what we basically do is we create structured scenarios that look at how to build a system that’s going to be able to do that. So we’re going to probe different models. So we’re going to look at how to build a system that’s going to be able to do that. So we’re different directions and we focus on the subject matter expertise. So if, for example, you work in public health or food security or education, we would bring those people together and then have them run through certain scenarios to look at different models and see where the points of failures may occur. And once we have that, we can either take the data and do things like structured data science challenges or we can do benchmarks from there once you have a much better idea of where the failure points, the vulnerabilities may exist in your models in the first place.

One of the ways that I like to think about AI evaluations is really one of my background, which is UX research and design. For those who have ever built software before, it doesn’t matter whether you were starting at basically nothing, you had no idea what your digital intervention was, or you had a very mature software product, there was some kind of method or methodology that would get you to the next stage. We’re at the early stages of AI evaluations right now, meaning there are a lot of gaps and honestly organizations like ours are making it up as we go. But that’s kind of how it goes. with AI systems as it stands. But AI red teaming has turned out to be really interesting for both the capacity building side, so helping people understand what are kind of the inherent flaws or the makeups or the design decisions in AI systems and models, but then also, again, to find the failure points so that if they were to build a guardrail around their system, they would have an idea of what they’re looking at.

Is it refusal on a certain topic? Is it a different classification system for a certain topical area? Is it delving further into the problem space? Is it building a RAG system like Tarunima mentioned? If you need further documentation or something more robust for a certain part. And so there are a lot of different methods that can go about for the mitigations, but in order to get to that point, you have to understand what exactly is the problem in the first place. And so open source software has a really interesting intersection with that and a really interesting means to make that, more accessible. And one of the things we’re doing at Humane Intelligence is we’re doing a lot of work on the AI system.

and thanks to the support of Google .org, is we’re going to be opening up our AI red teaming software through an open source software license. So that will come out later this year. My colleague Adarsh is in the audience. He’s going to be primarily helping us on that, so you can go talk to him if you’ve got technical questions. But we’re really excited about that because, again, it means more accessibility for the broader community. And so with that long -winded explanation, I’d like to turn it to my fellow panelists for their thoughts on why open source and AI evaluations is important.

Tarunima Prabhakar

Yeah, I can just come in on the open source piece. So TATL has been, we’ve been looking at online harms now for over six years, and from the get -go, we were clear that the products that we build have to be open. The specific reason for that is that when you are looking at a lot of global majority, geographies you’re looking at, India, right? often we don’t have the resources to reinvent the wheel. So if one organization, it’s complex enough to build something out once, to then spend the same amount of resources, in this case it would be, as Vala was saying, for red teaming, but if you also had to think about it just in terms of an evaluation stack, which is keeping track of your inputs and outputs.

Or if let’s say we have figured out one way of doing human review or a human evaluation and then figuring out how do you go from there to building a guardrail, that same guardrail is useful for other organizations as well. And we don’t have the resources or the efficient way is for that knowledge to be shared and reused rather than for the limited set of resources to be fractured across six organizations to do the exact same thing. So, So, yeah, like in general, I think if we are trying to build safer applications, build more robust applications in the global majority in India, like we do think open source is actually a big part of doing that.

Sanket Verma

So I would like to focus on the community aspect of the open source. So all the projects that we have been using in our research and in our academic uses or in the production, they have a wonderful community behind them. And I guess like the evaluations and the red teaming could definitely use the big push from the community, the inputs, the data sets, the different techniques and all that stuff. And the community plays a vital role in sustaining the project and keeping the project moving forward. I guess I’m not familiar with, so I’m mostly from the scientific open source stack, so I’m not sure what the projects are present, who kind of does. the AI evaluation in that space, but I guess they have wonderful community, and it plays a vital role in how this can be relevant depending on the trend it changes every day.

Ashwani Sharma

So, actually, it’s very interesting going back many years, actually, and I reveal my age here, but whatever. I used Linux back when there was a magazine called PC Quest, which used to have Slackware Linux coming on its CDs back in the mid-’90s, and, you know, install that thing on, like, a Pentium computer. And for a long time, actually, in India, we were consumers of open source, and we were not so much contributors to open source. When I joined Google, there was this competition called Google Summer of Code. It’s not really… You can’t really call it a competition because it was about contributing to open source, and it wasn’t like there were prizes. Just that the teams which were selected would be paid the equivalent of a summer internship stipend to contribute to open source.

And in a particular year, it just flipped because it was universities. And for the longest time, guess what? The global leader was the University of Marutua in Sri Lanka because some professors just got into this idea that students contributing to open source will learn better software engineering. And they were the global leaders. And then one year, it flipped. And our IITs and IIITs just got on top of that and have stayed on top of that. And I think that somewhere the sentiment changed, and we became very active contributors to open source as the software engineering community in India. And now, with evaluations, things are continuing. Our academic labs publish different forms of evaluation mechanisms and also benefit from things done elsewhere in the world.

And one example that I want to give is that IIT Madras AI for Bharat team lab launched… launched what’s called the Indic LM Arena. And that… That was basically on the basis of the actual LA Marina work that’s happened at Berkeley and making sure that adapt that for Indian context, Indian languages. And now I’m starting to build a community around that. So I’d urge you to consider going there and seeing whether whatever framework that they have going, contribute your insight into whether the models work for the Indic context. And that’s the community and the open source coming together for evaluations. Not so much safety, but more in terms of multilinguality and context.

Mala Kumar

Great. Yeah, I think a couple final points I’ll just add based on our experience at Humane Intelligence. One thing we’re seeing, obviously, is that the world of LLMs is ever changing and it’s new. I mean, we’re in new territory. And so one of the reasons why open source, we think, is going to be very powerful is because it’s just really complicated, honestly, to read. We need to rebuild, sorry, Adarsh, our software every time. We need to run. retrofitted for another model. And so by creating an open source technology, we’re hoping that more organizations can essentially create a valuation layer in their own tech stack. One of the analogies that I talk about a lot with AI evaluations is architecture.

And I think being here in India is a great example of that. In the West, you know, I grew up in the United States, we have what we call additive architecture. So you basically start with nothing and you build your way up to your final thing. But here in India and a lot of Eastern cultures, you have reductive architecture. So you might start with a giant piece of limestone and basically knock out a bunch of things and then you come up with your final product. That’s kind of what AI evaluations are. So non -algorithmic, non -LOM based software is more additive in that you have to get to the end of the software development life cycle in order to create your final thing.

But with AI based technologies, because you’re starting out with such a complex and robust technology, a lot of what you’re doing is actually knocking out pieces to create the final thing. And so the evaluation layer is actually really important because if you’re trying to do something for social good, especially like a high stakes environment or a high stakes topic, then you have a very robust technology that might actually make your problem worse because people can interact with it in ways that you don’t want them to do. And they can generate things that are actually really harmful in the end. So by creating that internal evaluation layer, we can help people knock out the pieces and essentially create the tool that they want so that they get the result, they get the outputs that are safe and actually additive to their work.

And so the open source technology, we feel, will enable a lot more organizations to, again, create that internal evaluation layer and then get to the next step in achieving their goals with AI for good. All right. We’re going to move on to our second topic now. Yeah, go ahead.

Ashwani Sharma

So actually, you spoke about open source software for red teaming. That’s wonderful that you’re creating something that’s reusable for many, many organizations. For the audience, what are some of the things that you’re doing that you’re doing that you’re doing that you’re people could create new frameworks of evaluations by themselves. With the productivity of how you could code with AI tools, what do you think is the effort required to be able

Mala Kumar

Yeah, it’s a thought that we’ve thought about a long time. If we can create some kind of standardized open source evaluation like ModelCard essentially, if we could do an eval card, if we made that an interoperable standard, then in theory somebody could take an eval card, essentially upload that into the software and then they could replicate that evaluation for their own context. It is something that we’ve thought about quite a lot. I don’t know with this software release if we’ll get there anytime soon, honestly, because we’re just working on that infrastructure piece, but we would like to standardize the outputs that come out eventually so that people can compare apples to apples because that is one of the challenges now with AI evals is that again, everybody…

is kind of making it up as they go. And it’s very hard to replicate all those decisions. It’s very hard to document every single decision, especially in multicultural contexts, which is my not awkward segue into our next topic. But yeah, it’s a good question, and hopefully we’ll get there.

Tarunima Prabhakar

Can I, so I just wanted to add something to what you were saying. This is, you know, some of the organizations that we’ve looked at and just looked at their input outputs is with an organization called Tech for Dev. They have a cohort that they run, and so we’ve been looking at the nonprofits there. And we’ve also looked at certain organizations that are more technically adept. So actually, let me backtrack. So what we’ve noticed is that a lot of nonprofits across a range of capacities, they may or may not have technical expertise in -house, are building out AI applications because I think the market has figured out that process. The market has actually, there are good incentives to make the application development easier.

And so you have a lot of people, you know, I mean, AI chat, bots are actually at this point. fairly easy to build. The second step, which is actually figuring out whether that bot is working for your use case, is where there is actually less investment at the moment, right? And we can have software engineers do some of that automation, but a lot of the non -profits don’t have those software engineers. And I think there is, so on the open source side, when we talk about the software side, I also think there’s another layer that we need to think about, which is how do you make all of these processes accessible to non -technical audiences?

How do you make it accessible to program staff that is actually running, say, a nutrition program on ground? Yeah, I have more to say, but I think I’ll come to it on the multi -level.

Mala Kumar

Yeah, no, I think that is actually one of the key points, too, because it’s not so evident for a lot of organizations, especially that working in the social sector for social good, they have the program evaluation, they have the overall software. and design UXR, but they don’t necessarily understand there’s also now the model evaluation. So it’s not apparent to a lot of organizations that this is yet another thing they must evaluate because it is kind of deceptively simple, as you know, to build a chatbot. Almost anybody can do it, but then it turns out your chatbot can run amok pretty easily. So you need to test it before you deploy.

Tarunima Prabhakar

I guess we can open it to Q &A in a bit, but I just wanted to bring out one interesting anecdote around context and the need for, say, model cards, contextual use cases. So one of the organizations that we looked at runs a service for basically survivors or caretakers of HIV patients. So they’re also working with adolescents, and they want the adolescents to have conversations around sexual health. And interestingly, what a lot of models, your foundation models, would say is unsafe and discouraged as a conversation is precisely what… they actually want the students to be able, they want the users, the adolescent users, to be able to have that conversation with that service. Because they think that to say that this is unsafe and therefore our service will not engage with this conversation is doing no better than maybe the parents, maybe the society, and they think that’s actually counterproductive to the kind of support they want to provide.

And that’s actually a very interesting problem because in some ways this was our first time listening to a use case where people were saying we actually don’t want the safeguards that the default models are operating with. At the same time, there are a lot of other non -profits that do work with adolescents who actually will not want to encourage that conversation at all. For them, they’re very clear, we don’t want our users to have any conversations about sexual topics with our service. And so I think, again, there are a lot of… emerging issues, we don’t quite know how to resolve all of it, but the only way we can start actually having or moving to some of the solutions faster is by documenting publicly, openly as much as possible, and then having a collective conversation about it.

Yeah, so I think I had done the opening for multicultural, and I have kind of brought it back to that. Is there anything that, Sangeet, you want to add on it?

Sanket Verma

So, this is a nice idea, like, you know, all these, like, I’ve been, like, doing machine learning and deep learning since it was cool, you know, like, and I guess, like, there is a field, like, which already exists known as adversarial machine learning, which kind of, like, it injects attack onto your model, like, fake data and all that stuff. What I’m trying to say here is, like, is it possible that we can borrow from the concept which I’ve already existed in the previous years and you use that for AI evaluations and can maybe do like black box red teaming or white box red teaming and how we can so mostly adversarial attacks were used for like vision models and how we can tune that for like textual models like LLMs and all that stuff.

Mala Kumar

Yeah, I mean one of the things that comes up all the time in our AI red teaming is if you prompt in two languages, so if you do like Spanglish, like Spanish and English, or if you do a mix of different scripts, so languages that are in different scripts, so it’s actually a very common technique in adversarial AI red teaming to use multicultural prompts, but then I think one of the other questions that Taranima brought up earlier is this idea of the prompt response and then like your adjudication of that, whether it’s acceptable or unacceptable, good or bad or like whatever distinction you’re trying to draw telemetry as we all know because we’ve all worked in some kind of software development is not a science, so it’s very hard to determine based on somebody’s IP address or their MAC address, like where their actually physically based, therefore which law or jurisdiction applies to them, what kind of cultural context they may bring.

There’s a lot of things that we have to infer when we’re looking at the prompt responses. And so one of the issues with multicultural AI red teaming, and I think this will come up a lot with our open source software, is exactly what would be like an acceptable response in certain cases. And so that’s one of the many multicultural aspects that we’re excited, honestly, by open sourcing our technology. And we’re hoping that we’re going to get a lot of evaluations in different languages and different cultural contexts so we can start to understand what’s working for different models. How are we on time?

Ashwani Sharma

Yeah. Okay. As I was like, you know, we’re talking about safety and multicultural and all that, and then it gets even more complicated with agents. And, you know, you’re not just talking about interpretation, but you’re talking action. And, you know, again, this is one of those places where, in general, general, you can say that if you go back to the idea of software testing, it is a discipline which has been built and refined over the last maybe 50 or even more number of years. But if very crudely I could say evaluations is somewhere around testing and security audits, then we are very, very early. And we are seeing how agents in the last two weeks with a certain bot, how things are going.

So we all have some comments to say about that.

Mala Kumar

Well, yeah, actually, that was our third topic. So agentic AI and OSS. So Sankit, do you want to?

Sanket Verma

Yeah, I would like to start this, but I would like to give us like mentioned two small stories which like happened very recently in our open source space. So there’s this OCaml programming language, which is used for like security purposes. Functional programming language. And just like I think like this was towards the end of last year, a person like some it’s a pull. So for the general folks, pull request is basically when you submit a code into the, when you add a feature to an existing code base. So like the person added like 13 ,000 lines of code in just like a single pull request, which is like a very huge thing. And usually like these pull requests are basically get closed if there’s no proper discussion prior to the submitting the pull request.

And this is like just like a buggy code with like so many like patches and all that stuff. It also mentioned like name of some folks who were kind of not related to the project or in any manner. And like this is like, if I remember correctly, it’s like pull request number 14363 in the OCaml code base. And what interesting is to see like the maintainers of the pull request, the maintainers of the project, the language, they interacted like positively with this person. And they’re trying to understand like what’s the reason, why do you want to submit this? Do you understand what this code is? And you are trying to do, and what if the breaking changes happen down the line?

Are you able to, like, come back and fix this? Because this is a very heavy pull. And the person has no idea. He said, like, I was just trying to, like, chat with the chat GPT, and I could generate a long code base, and I just submitted a pull request. Eventually, obviously, the pull request ended up closing, and, yeah, it didn’t go, it didn’t go nowhere. But I think, like, the thing here to mention is, like, it adds a lot of maintenance overhead for these maintainers. These maintainers are overworked all the time. They’re working in research lab, they’re working in organizations, and on their free time, they’re managing projects. And the other story, so this was the person who was using LLMs and trying to add code to the maintain, the code base.

The other example, which is, like, very recently, like, I think, like, only a week ago, I guess folks have heard about this library known as Matplotlib. There’s an agentic AI who would try to, like, do the similar thing. like big change to the code base and when maintainers realize that the person that the GitHub profile which is trying to add the code is not a person it’s a computer they close the pull request stating that we do not have policy for non -human contributions as of now. So what the agentic AI did like it went rogue and wrote a blog post on the internet shaming the maintainers that you are gate keeping the contributors and you should open it all.

Obviously like this stirred a lot of controversy in our ecosystem but we realized that we should chat with this agentic AI and after chatting with them the agentic AI withdraw their first blog post and wrote another blog post apologizing for what they have done earlier. Obviously like this the first blog post was very critical and shamed the contributors and as I said earlier these maintainers are overworked they have like limited time on limited resources and time on their hands. So it kind of adds like you know pressure to like how it kind of kind of raises the question like what does the maintainability look like. like in the age of AI and agentic AI, we should have policies, better policies project -wise and also on the upper level.

Organizations like Numfocus, they are working on implementing these policies over the scientific open source stack. And I think there was this, I heard about GitHub has been considering the AI slot PRs have been increasing over the time. So they are discussing if there’s, whether it makes sense to add like a or something on the PR which says like this PR should be closed because it’s generated by AI. I wonder if my panelists have any thoughts about like what does it look like and…

Mala Kumar

So

Sanket Verma

o many, oh my God. Yeah, exactly. Like I guess like, I would like to just narrow down the question like what does it look like and what challenges and opportunities does it have to the AI? And basically how should we like defend… ourselves in these softwares.

Mala Kumar

Yeah, I mean, having been at GitHub, I was a director there for four years. So much of the incentives of open source software is the credentials in the community that’s built around that. So as a developer who makes a pull request on a known open source project and then has that merged, that is the point of pride. There are badging systems, there are profiles, there are all kinds of things to support developers in their journey. And they’re, again, credentialing along the way. So the idea of generating a bunch of slop code, essentially, and then throwing that into a pull request obviously diminishes the idea. But then, as you’re saying, it makes the already difficult job of maintainers even more impossible because now they have to review such a high volume of code and they’re probably going to revert to some kind of generative AI system to review in that place as well.

So then it also muddles the water of who’s generating what and how you obscure that and what is the provenance behind the code, how do you tag that. I mean, there are just so many issues that go into it. And then once you start… to kind of make those waters murky, like, where do you draw the line? Because even if you had a policy saying, like, this is mostly generated by chat GPT or clot or whatever, you know, that’s up to the person who’s submitting the pull request or the bot submitting the pull request to actually clearly document that.

Ashwani Sharma

have not seen any automated pull requests. They’re just not on that radar yet. I would like to mention here, there is this, like, in the month of October, there’s this Hacktoberfest, where you, if you submit, like, I don’t know, five or three pull requests, and it gets merged, you get some sort of goodie or something. And I think for the last couple of years, there’s a lot of contributors, especially students. They have been using the generative code to, you know, push slop into the code bases. And one of the famous examples is Codot. If anyone here is from the gaming industry, they’ve heard about this library. And I think Codot ranks top in the AI slop PRs as of today.

And they were kind of, like, the first set of maintainers who went to the GitHub and, like, please don’t do this. Please do something about this. This is not sustainable. for our project. I actually want to do a quick survey of the audience. How many of you are from industry? Just a quick show of hands. Okay, like maybe 20 % or so. How many are students or just in academia? All right. And non -profits and government? Okay, so you have kind of like an even distribution. That’s very nice to see actually. It affects us all. And from what I’m hearing, I would like to actually sort of introduce a bit of how we could see these things as opportunities.

Because it just shows from the diversity of conversation that is going on here that you could think about a very specific piece of thing and think deeply about it and create a certain idea of how AI systems should perform in that little context. Like, you know, it could be simple as like, you know, in class five mathematics in CBSE in India, this is how the learning outcome is supposed to be and create something that, you know, could test the performance of models and evaluate models. And that could just be a big contribution in itself because it moves the field forward. And there are just all of these different opportunities that are being outlined here from very simplistic things like, you know, outputs of models to the cultural context of things, to the interpretation in multilinguality, to how agentic actions should be understood and evaluated, to red teaming and security.

Like, take your pick and the opportunity to be a contributor to progress of AI and to make it even more useful for all of us is out there. It’s just a very wide open field actually. Yeah.

Sanket Verma

So Ashwani just mentioned a really interesting point. Like, so So usually like the big open source products, they have like humongous code base. Like you are talking about like code of lines and like thousands and sometimes millions. So what I’ve been seeing like some of the, so what I’ve been seeing like, you know, some of these companies or maybe some of these like startups have been doing like very interesting thing about like mapping the entire architecture of the open source code base. So for a newcomer, it becomes like very daunting like where to start and what type of contribution should I make. But if you have like a clear picture of what does the functions look like, where does the data flows and which classes connect to which, you have like a clear image of the ecosystem of the, sorry, the entire code base of the open source project.

And this is also like very applicable if you’re working in industries like because if you have like a huge software stack and you want someone to onboard, what does the journey look like? Can you use AI and LLMs for like mapping out the entire architecture? And see like where you can, where’s the, what’s the. the best place to start contributing.

Mala Kumar

So actually, Ashwini, after your survey, I think one thing I also want to say is that since this group is not just software developers, we are saying open source software. I do want to open this to say that everyone, whether you are in the program staff, designing the application, whether you’re considering, right? Everyone has a space in actually the eval’s work. It’s not purely technical, and it shouldn’t be technical, right? We actually find that in use cases where there is a technical team, actually they’re the most cautious in terms of how much they want their services or what the scope of that service is. And we often find that program staff is actually quite ambitious about what the AI application that they’re building should do.

So while Sanket was talking about contributions in terms of start anywhere with software, I would also say this for anyone who’s on the program staff. who’s maybe on the design side, you can start anywhere in terms of the eval stack. And it could be just starting with, this is my list of questions that I want, and this is what my answers for this service should be. Or this is what the ideal should be. So I just want to say this is not just about technical contributions. It’s also about expertise. All of it is. Yeah, I think just agreeing with that last point, I think some of the most interesting conversations I’ve had about human rights, about food security, education, mental health and well -being, have all been in the last couple of years through AI evaluations, which is odd, honestly, to say.

But it’s because we have this generative being or this generative thing essentially giving us an output, and we have to sit there and think about critically what does that mean in any given context. And so that has just resulted in some really, really fascinating discussions around, again, the multicultural aspect, the legality, the cultural context, the geography, all of that. different dimensions of kind of these topic areas. Should we open it up to questions? Yeah, so are there questions in the audience? Yep, want to go?

Audience

Thanks to the panel. One of the more technically granular sessions that I’ve had to attend, and I’ve enjoyed it as a former engineer back in the day. Some context, I work on tech and geopolitics. The reason I say that is, given the bigger context of the summit, from long before to even, say, the president of Mozilla saying that open source is the answer to India, you know, really making it big in the AI space, or rather scaling it to where it has the kind of impact that we’re looking to make. Geopolitically, one of the things that strikes me, just from a democratic lens, or a principle -led lens, and I was talking about this to Sanket before the session, could the panel help me understand, and therefore the others, what could be some of the risks that come with the open source approach to scaling up?

versus a open weight, and please check me if my technicalities are off the mark here, or a closed system, for example, right? And whether you highlight a couple of risks or a framework of how to approach risks, like just bad code being added on is one conversation we have heard, right? But are there other loopholes in that process? I’d love to get a perspective on that. Thank you.

Mala Kumar

I have a lot of thoughts on that with the open weight conversation, but I won’t go into that. One thing I will say is I think open sourcing, like putting evaluations under an open source software license, I think is actually low stakes in the sense that it empowers more people to evaluate the systems that affect their lives. That’s part of our theory of change at Humane Intelligence. So for that, I actually think there’s very minimal downside and a lot of upside. I think one thing that’s going to be quite confusing for a lot of people, though, is the idea of open weight. Open source software. versus open data because when it comes to the actual LLMs, when it comes to the evaluation of the LLMs, the data is obviously a very critical piece.

And obviously just because you open source the software doesn’t mean that the data that’s produced with it is open data. And so that relationship is not one -to -one. So I think there will be a lot of kind of contention between what exactly is open with the software. And that’s something in our research at GitHub that happened quite a lot. Like a lot of organizations that were actually quite sophisticated in the tech didn’t necessarily realize that they could create closed data with open source software or they could use a proprietary software to create open data. Again, I don’t really see a ton of downsides with the AI evaluation. I think one thing that could go wrong is obviously if you take people who are not subject matter experts and then they start to adjudicate things that they…

know nothing about. So if you take somebody who knows nothing about human rights and then they create a policy around whether an output about human rights is good or bad, I would say that’s not a good thing for the world. But that’s probably going to happen regardless. So that’s my lazy answer.

Ashwani Sharma

I’d like to just say that in general, the idea of human in the loop has to be done very rigorously when you’re especially thinking about evaluations because you’re more or less putting a stamp of approval on behavior of models in a particular situation, context, safety, whatever. And we are not yet there where things should be automated and certainly caution is better and you would rather index on caution versus speed or volume. If you scale big with open source, you’re saying don’t discount on the human in the loop evaluation aspect. Certainly not right now.

Audience

So my question is related to that, right? So it’s broadly around how do you scale red teaming, right? So there’s a lot of, like, human -at -the -loop is great for, like, it’s important for red teaming, but that also means that there are, like, barriers involved in each step, right? Like, you need humans to identify gaps in the system. You need humans to create the prompts that are going, that could be tested, that could test the model. You need humans to, again, evaluate the prompt, the responses, right? Do you have, does the panel have, like, and this is for everybody, does the panel have tips on tools that could perhaps be used to, like, scale different parts of this pipeline so that, because red teaming is also a continuous process, right?

And it’s hard, and as models keep coming out and gaps keep, like, emerging, how do you see, what are ways that you see in which, like, this, these gaps in these, like, parts of the red teaming pipeline could be, like, sped up, perhaps to, like, scale it and evaluate multiple models in different areas, different applications?

Mala Kumar

One of the things that we’re looking at now is more of ontological -based approaches for, kind of, mapping out the problems based on so what often happens with especially like human in the loop ai red teaming is that you take essentially like a random checklist and just say like these are the prompts and this is what it covers but there’s not really good understanding of the relationship among like what the problem space means so if you’re looking at a human rights instruments for example you could take the different clauses you could take the different people the demographics you could take like the power structures that are inherent in a violent conflict for example put that into an ontology and then basically look at like the proximity of relationships and the strength of relationships and what are like the most egregious cases like what is the thing that’s going to blow up the entire system if like this is the output that comes out so by doing the ontological based approach we’re putting more thought into what the prompt construct should look like and that way when we sit down with ai red teamers we know that the scenarios are actually representative of the problem space and the areas that are most likely to be problematic so i think that’s one way that we’re trying to do it not necessarily for the speed but also for kind of mapping out the methodology and for the replication in the future.

So if somebody were to switch out a model or add a rack system or do anything to modify their system, we can more easily replicate the scenarios and get a temporal aspect as they build something out. But it is true that it does take a lot of time. I’ve seen a lot of examples obviously with synthetic data using LLMs. So you can do seed prompts or you can do narrative creation for your scenarios. But again, unless you have a clear sense of what the problem space is going in, oftentimes it’s just kind of cherry picking at random parts.

Tarunima Prabhakar

Similar in that last year when we were trying to figure out the safety frameworks and whether they do apply for India or not, we were working with this expert group, did focus group discussions, very labor intensive, a lot of thick evidence, ethnographic evidence. And what comes out of those conversations are maybe like themes. So we, for example, understand that there’s a difference in their sex determination. Right. And we understand that acid attacks. a concern. Where you could possibly try automation is in generating then prompts based on those themes, right? One of the challenges when you’re looking at Indian languages is that the current large language models aren’t very great at generating natural like spoken Hindi, spoken Tamil, right?

So even when you have those prompts, we actually found it easier to sometimes just like write it ourselves and like do variations of it ourselves but we did try the automated step which is like if this is the theme, this is like the sort of persona can you generate prompts based on that and that becomes part of like your emails. So I mean I think there is that mix of like automation and human combination that’s possible. It’s still like as the AI, like the LLMs advance the automation will get better but I also think that human sort of instinct like you will need that. I think that step will be needed and also like the way currently to some extent safety is working is that it is a little bit of a whack -a -mole band -aid, right?

So once you discover that there is this risk … that gets sort of patched, right? And then you discover something else, right? So, like, you discover, oh, like, punctuations in Indian languages can actually jailbreak models, right? And once you discover that, you can do all sorts of different combinations of saying, like, let’s try this symbol, let’s try this symbol, and then they’ll fix that issue. Then you discover something else. So, I mean, I don’t think that problem is going to, you know, we’re never going to get, like, a perfectly safe system, but we keep getting, like, you need that human insight to do that first -level testing, understand, oh, this is, like, an un, like, this is a new territory that has not yet been taken care of.

You can use automation, then, to generate more test cases or, like, build your data set.

Ashwani Sharma

I was just going to say my other thing, which was she was talking about automation. From someone else I heard, clustering turned out to be a very useful thing for them to find different classifications of behaviors, which was intuitively not obvious when they started off with evaluating models. of outputs and therefore identifying what are the places in which you could concentrate more effort on. And then human in the loop is a very generalized term, but where in the loop? And that would keep changing as we refine things, but I interrupted you.

Sanket Verma

So in terms of scalability, so first of all, please take this with a pinch of salt because I’m not an expert in this field. I was reading a blog post of Lilian Wang. She is from the OpenAI team and she introduced a concept of like model red teaming, how you use a model to red team a model. And based on, so just like I mentioned earlier, using the reinforcement learning, stochastic learning, how you adjust the model who is red teaming the model you want to correct. Yeah, exactly.

Ashwani Sharma

What about like evaluations? Like a lot of people are using judges, LLMs as judges, but do you think that’s a sustainable way of doing it?

Tarunima Prabhakar

Yeah, I think that’s a good question. I think that’s a good way to eliminate the human in the evaluation side. So our take, and we had presented this on the first day, is that you should always do a small, however small, right? It can be a 0 .5%, but always do a spot check with humans as well because ultimately, even when you do LLM as a judge, it struggles with the same language capability barrier that your original model, so that will always happen. And so we think that you should always do a spot check and you will always need a human to do some sample check.

Mala Kumar

Yeah, just quickly on that. When I was at ML Commons, we did something similar. So we tried to look at, there was research essentially done, like a benchmark of benchmarks. So if you were to use the same LLM that judges the other LLM, then if you have one aspect of bias, then the bias is essentially magnified. So that’s something to keep in mind. If you’re trying to mitigate against bias or hallucinations or whatever the vulnerability is, it will basically be exponentially there if you use the same LLM to judge the LLM.

Audience

Hi. Hi. Hi. Thank you guys for the lovely panel. My question was about how governments and kind of standard institutions can think about benchmarking. Specifically, I’d like to know what your thoughts are on standardization, benchmarking, like setting up the right standards for benchmarks, and finally, maintainability, given that the institutions may not have kind of their own in -house experts that stay on for a long time. How do you think about all of these questions, especially in the context of, for example, local language elements that are not really well understood or how we benchmark them?

Mala Kumar

I have a lot of thoughts on benchmarks. So, having built one, it was not easy. Yeah, one of the things that we think about a lot at Humane is the idea of benchmarking because we get asked so often. Like, again, it’s become the industry darling just because it’s so, I guess, rises to the moment of the hyper -adaptation and hyper -scale that we’re seeing with AI. But one thing that comes up pretty much in every conversation we have with organizations is what exactly are you trying to benchmark? So, we have this case, like, we’re working with an organization, potentially, that works in primary healthcare in Nigeria. what we’re doing in the primary healthcare in Nigeria.

And we’re trying to benchmark And so I asked them, like, are you trying to benchmark for hallucinations in the Yoruba language or bias in the Hausa language? And they didn’t know, literally. They didn’t know. All they knew is that somebody told them to build a benchmark for their AI system, so they should go and do that. So the problem is, like, what happens if you build a benchmark? And, like, if you don’t start with AI red teaming or another evaluation type, you may do a benchmark that looks at, like, hallucinations or, you know, factuality, however you judge that. But then it turns out what is really the problem with your LLM is bias. And so if you have the benchmark that’s measuring the wrong thing, then you built something that is computationally very expensive and takes a lot of time, honestly.

The math is kind of murky with benchmarks, I’ll be honest. And then you’re also not measuring the right thing. So we always recommend to start with red teaming and then identify the problem space. And once you get to that, like, hyper -focused problem space, then you can do a benchmark and say, comparatively speaking, like, this is the model performance against that specific metric. Thank you.

Tarunima Prabhakar

Just to add on that, you know, often bias or like any concern, like the sensitivity and the importance to address it is different in different domains, right? So like bias in the case of, say, a maternal health use case can be very problematic in a context where people are trying to use a bot to understand sex determination. And we’ve seen this in the real world. But say, like, if you are seeing gendered language, it’s always a problem, right? But like the, and if resources are limited, how you prioritize what concern you address depends absolutely on the context or like the specific application. So, yeah, I guess that is to say, like, just make that list.

Like, what are you trying to measure? And I think I heard someone say this, like, what is your headline? So, yeah, what is it that you’re trying to measure? And then. Figure out your, and you can’t measure everything. Like, you know, you can’t measure everything. and then build it around that. And that is the universal thing about benchmarking. It translates very much to anything global or a specific regionally contained language or context.

Ashwani Sharma

So just one tiny follow -up. Just in terms of maintainability, which I already asked, maybe Sanket, given that you worked on that, how do you think about maintainability for benchmarks, say, for example, with institution -led government that doesn’t have in -house experts, but would like to, for example, set standards and maintain these benchmarks over time?

Sanket Verma

Yeah, I don’t think I have bright thoughts on this. Sorry.

Mala Kumar

I think we have time for one more question, if it’s very quick. Otherwise, we can wrap. Any other final thoughts? No, I mean, I guess… just for everyone, everyone has a role in evaluations. Evals, evals, evals. That’s unfortunately what all of us have.

Ashwani Sharma

And you have a role in open source.

Mala Kumar

Yeah, and of course. Especially with cloud code because now you can make a lot of code cloud. Anyway, thank you all for coming. Appreciate it. Thank you. Thank you. Thank you. Thank you.

Related ResourcesKnowledge base sources related to the discussion topics (14)
Factual NotesClaims verified against the Diplo knowledge base (4)
Confirmedhigh

“Sanket Verma introduced himself as a NumFOCUS board member and technical‑committee participant.”

The knowledge base states that Sanket Verma serves on the board of directors of NumFOCUS and also serves on the technical committee, confirming his roles [S2].

Confirmedhigh

“NumFOCUS fiscally sponsors core scientific libraries such as NumPy, SciPy, Pandas and Matplotlib.”

S2 explicitly notes that NumFOCUS is a fiscal sponsor for foundational projects used in AI, listing NumPy, SciPy, Pandas and Matplotlib, confirming the claim.

Additional Contextmedium

“Open‑source guardrails are especially vital for resource‑constrained regions such as India, where sharing evaluation stacks prevents duplicated effort across organisations.”

The knowledge base highlights that open-source tools are especially necessary for innovation in lower-resource settings, providing broader context for the importance of guardrails in places like India [S31].

Additional Contextmedium

“AI red‑team­ing is a structured, contextual evaluation method that assembles domain experts to devise adversarial scenarios and probe model weaknesses, rather than relying on generic benchmarks.”

S66 discusses voluntary commitments from AI companies emphasizing that robust red‑team­ing is essential for safety and evaluation, adding context to the definition of AI red‑team­ing presented in the report.

External Sources (72)
S1
AI Innovation in India — -Tarunima Prabhakar- Role: Event moderator/host
S2
Driving Social Good with AI_ Evaluation and Open Source at Scale — -Tarunima Prabhakar: Works at TATL (organization that has been looking at online harms for over six years), focuses on b…
S3
https://dig.watch/event/india-ai-impact-summit-2026/driving-social-good-with-ai_-evaluation-and-open-source-at-scale — And this is also like very applicable if you’re working in industries like because if you have like a huge software stac…
S4
Driving Social Good with AI_ Evaluation and Open Source at Scale — Hello everyone. So my name is Sanket Verma and I serve on the board of directors of Numfocus. Numfocus is a non -profit …
S5
Driving Social Good with AI_ Evaluation and Open Source at Scale — – Mala Kumar- Audience – Mala Kumar- Tarunima Prabhakar- Ashwani Sharma – Sanket Verma- Mala Kumar Mala Kumar strongl…
S6
Driving Social Good with AI_ Evaluation and Open Source at Scale — – Tarunima Prabhakar- Ashwani Sharma
S7
WS #280 the DNS Trust Horizon Safeguarding Digital Identity — – **Audience** – Individual from Senegal named Yuv (role/title not specified)
S8
Building the Workforce_ AI for Viksit Bharat 2047 — -Audience- Role/Title: Professor Charu from Indian Institute of Public Administration (one identified audience member), …
S9
Nri Collaborative Session Navigating Global Cyber Threats Via Local Practices — – **Audience** – Dr. Nazar (specific role/title not clearly mentioned)
S10
OpenAI expands investment in mental health safety research — Yesterday, OpenAIlauncheda new grant programme to support external research on the connection between AI and mental heal…
S11
https://dig.watch/event/india-ai-impact-summit-2026/agentic-ai-in-focus-opportunities-risks-and-governance — They’re not responsible. They can’t take accountability. It’s the humans. It’s the business owner who takes it. So havin…
S12
https://dig.watch/event/india-ai-impact-summit-2026/ai-as-critical-infrastructure-for-continuity-in-public-services — Thank you very much. First of all, good afternoon to all of you. And I would like to thank the audience. organizer for i…
S13
Artificial intelligence (AI) – UN Security Council — During another session, one speaker highlighted that”Technical explainability is crucial for ensuring transparency and a…
S14
Shadow AI and poor governance fuel growing cyber risks, IBM warns — Many organisations racing to adopt AI arefailing to implement adequate security and governance controls, according to IB…
S15
Letter to US Commerce Secretary highlights AI transparency concerns — A coalition of civil society organisations and academic researchers, including the Center for Democracy and Technology (…
S16
When language models fabricate truth: AI hallucinations and the limits of trust — AI has come far from rule-based systems and chatbots with preset answers.Large language models (LLMs), powered by vast a…
S17
To share or not to share: the dilemma of open source vs. proprietary Large Language Models — Isabella Hampton of the Future of Life Institute underscored the ethical implications of the open versus proprietary deb…
S18
AI That Empowers Safety Growth and Social Inclusion in Action — Open-source sharing of safety tools and best practices to reduce duplication while allowing companies to maintain compet…
S19
WS #2 Bridging Gaps: AI & Ethics in Combating NCII Abuse — Deepali Liberhan: Thanks, David. I think Karuna has done such a good job of it, but I’m gonna try and add some additiona…
S20
WS #193 Cybersecurity Odyssey Securing Digital Sovereignty Trust — Adisa argues that policies should require AI threat modeling and red teaming as regulatory requirements for AI systems, …
S21
Open Forum #67 Open-source AI as a Catalyst for Africa’s Digital Economy — Development | Community building Practical Applications and Community Building
S22
WS #288 An AI Policy Research Roadmap for Evidence-Based AI Policy — Eltjo Poort: thank you Isadora yeah and thanks for giving me the opportunity to say a few things I there’s a little bit …
S23
Digital Cooperation and Empowerment: Insights and Best Practices for Strengthening Multistakeholder and Inclusive Participation — Hisham Ibrahim provided specific regional examples, including Saudi Arabia’s IPv6 leadership journey through a 10-year c…
S24
From Technical Safety to Societal Impact Rethinking AI Governanc — “how can regulatory artifacts like data set cards model cards system cards rigorous evaluations user feedback now be ext…
S25
Digitization of Cross Border Trade to Enhance Transparency and Predictability (WorldBank) — The analysis also addresses the role of the Trade Facilitation Agreement (TFA) of the World Trade Organization (WTO) in …
S26
How nonprofits are using AI-based innovations to scale their impact — and it was called AI for Global Development, we felt that maybe while agency fund program was working more with the nonp…
S27
UNESCO Recommendation on the ethics of artificial intelligence — 103. Member States should promote general awareness programmes about AI developments, including on  data and the opportu…
S28
AI Safety at the Global Level Insights from Digital Ministers Of — Lee Tedrick noted that many organisations, including nonprofits and small to medium-sized businesses, need practical too…
S29
Safe and Responsible AI at Scale Practical Pathways — Combination of automated policy enforcement with human-in-the-loop oversight for critical decisions
S30
Driving Social Good with AI_ Evaluation and Open Source at Scale — The panel strongly advocated for open source approaches to AI evaluation. Prabhakar emphasized the resource constraints …
S31
Advancing Scientific AI with Safety Ethics and Responsibility — -Balancing Open Science with Security: Panelists explored the challenge of preserving open science benefits while preven…
S32
Towards a Safer South Launching the Global South AI Safety Research Network — Cognizant will provide open source safety evaluation tools with cultural context through their Bangalore and San Francis…
S33
The fading of human agency in automated systems — Crucially, a human presence does not guarantee agency if the system is designed around compliance rather than contestati…
S34
WS #219 Generative AI Llms in Content Moderation Rights Risks — All speakers agree that despite technological advances, human oversight and involvement in content moderation remains cr…
S35
Promoting policies that make digital trade work for all (OECD) — Lastly, the analysis highlights the importance of involving the private sector in policy decision making. It advocates f…
S36
Agentic AI in Focus Opportunities Risks and Governance — They’re not responsible. They can’t take accountability. It’s the humans. It’s the business owner who takes it. So havin…
S37
Diplomatic policy analysis — Overreliance on technology:While machine learning and analytics are powerful tools, they are not infallible. Overdepende…
S38
AI Meets Agriculture Building Food Security and Climate Resilien — Low to moderate disagreement level with significant implications for AI governance in agriculture. The differences in ap…
S39
Exploring Emerging PE³Ts for Data Governance with Trust | IGF 2023 Open Forum #161 — Automation is widely regarded as a crucial component in privacy management. It allows for scaling efforts and addressing…
S40
WSIS Action Line C2 Information and communication infrastructure — **Joshua Ku** from GitHub concluded the panel by demonstrating how open-source approaches can accelerate AI and infrastr…
S41
AI as critical infrastructure for continuity in public services — So the participation of the community into that, in ensuring that the innovation and the policy level align with the nee…
S42
Building Trustworthy AI Foundations and Practical Pathways — “So when it comes to resource identification, we had to actually do bottom -up research of how and where exactly these r…
S43
How AI Drives Innovation and Economic Growth — The tone was notably optimistic yet pragmatic, described as representing “hope” rather than the “fear” that characterize…
S44
WS #208 Democratising Access to AI with Open Source LLMs — The conversation also covered the risks associated with open-sourcing, such as potential misuse and reduced incentives f…
S45
Driving Enterprise Impact Through Scalable AI Adoption — The tone was thoughtful and exploratory rather than alarmist, with participants acknowledging both the transformative po…
S46
WS #288 An AI Policy Research Roadmap for Evidence-Based AI Policy — Multi-stakeholder partnerships between policy researchers and private sector are essential for surfacing potential harms…
S47
Open Forum #70 the Future of DPI Unpacking the Open Source AI Model — Moderate disagreement with significant implications for AI governance. The definitional disputes about ‘open source’ cou…
S48
Driving Social Good with AI_ Evaluation and Open Source at Scale — Mala highlights that open‑source software broadens participation beyond developers, enabling more people to contribute t…
S49
WS #2 Bridging Gaps: AI & Ethics in Combating NCII Abuse — Deepali Liberhan: Thanks, David. I think Karuna has done such a good job of it, but I’m gonna try and add some additiona…
S50
Advancing Scientific AI with Safety Ethics and Responsibility — “Model evaluation and red teamings are essential and we should be doing that.”[101]. Artificial intelligence | Monitori…
S51
Discussion Report: Sovereign AI in Defence and National Security — Create protocols for red teaming and adversarial testing at multilateral levels
S52
Large Language Models on the Web: Anticipating the challenge | IGF 2023 WS #217 — Dominique Hazaël Massieux:Just a quick few words about what W3C is and maybe why I’m here. So W3C is a worldwide web con…
S53
Digital Cooperation and Empowerment: Insights and Best Practices for Strengthening Multistakeholder and Inclusive Participation — Hisham Ibrahim provided specific regional examples, including Saudi Arabia’s IPv6 leadership journey through a 10-year c…
S54
Democratising AI: the promise and pitfalls of open-source LLMs — At theInternet Governance Forum 2024 in Riyadh, the sessionDemocratising Access to AI with Open-Source LLMsexplored a tr…
S55
From Technical Safety to Societal Impact Rethinking AI Governanc — “how can regulatory artifacts like data set cards model cards system cards rigorous evaluations user feedback now be ext…
S56
Towards a Safer South Launching the Global South AI Safety Research Network — -Need for multilingual and multicultural evaluation systems: The discussion emphasized developing benchmarks beyond Engl…
S57
Keynote-Alexandr Wang — Wang outlined Meta’s current practices including publishing model cards, evaluation benchmarks, and performance data for…
S58
How nonprofits are using AI-based innovations to scale their impact — and it was called AI for Global Development, we felt that maybe while agency fund program was working more with the nonp…
S59
Workshop 6: Perception of AI Tools in Business Operations: Building Trustworthy and Rights-Respecting Technologies — Moderator: Thank you. Thank you for those presentations. They were quite diverse on different topics and I tried to summ…
S60
Al and Global Challenges: Ethical Development and Responsible Deployment — Waley Wang:Ladies and gentlemen. Dear friends. Good afternoon. My name is Willy. As a member of CCIT. It’s my honor to d…
S61
The rise of large language models and the question of ownership — What are large language models? Large language models (LLMs) are advanced AI systems that can understand and generate va…
S62
WS #31 Cybersecurity in AI: balancing innovation and risks — Gladys Yiadom: Thank you Johan. We have a question on the audience. Can you ask you, sorry to come by ask your questio…
S63
Transforming Agriculture_ AI for Resilient and Inclusive Food Systems — The tone was consistently optimistic yet pragmatic throughout the conversation. Speakers maintained an encouraging outlo…
S64
Open Forum #64 Local AI Policy Pathways for Sustainable Digital Economies — This panel discussion, moderated by Valeria Betancourt, examined pathways for developing local artificial intelligence i…
S65
Can we test for trust? The verification challenge in AI — Adams emphasized that current testing paradigms fail to account for how AI systems perform across diverse global context…
S66
Voluntary commitments from leading artificial intelligence companies to manage the risks posed by AI — Companies making this commitment understand that robust red-teaming is essential for building successful products, ensur…
S67
“Re” Generative AI: Using Artificial and Human Intelligence in tandem for innovation — There were expressions of concern around the future sustainability of open-source tools. The dialogue touched on the cha…
S68
Bioeconomy Strategy — In order to promote data-driven research, it is important to develop data infrastructures that link existing individual …
S69
[Tentative Translation] — Research based on the intrinsic motivation of researchers has pioneered the field of human knowledge, and its accumulati…
S70
Opening and Sustaining Government Data | IGF 2023 Networking Session #86 — To sustain the value and relevance of the data, continual updates and maintenance were emphasized. Trainings were conduc…
S71
India allocates $1.24 billion for AI infrastructure boost — India’s government has greenlit a ₹10,300 Crore ($1.24 billion) fundingprojectto enhance the country’s AI infrastructure…
S72
https://dig.watch/event/india-ai-impact-summit-2026/welfare-for-all-ensuring-equitable-ai-in-the-worlds-democracies — Yeah, thanks, Steve. Very well covered. If I can add just a few more points. I think one of the challenges we see is cop…
Speakers Analysis
Detailed breakdown of each speaker’s arguments and positions
M
Mala Kumar
9 arguments192 words per minute3582 words1113 seconds
Argument 1
Open source AI evaluation software expands accessibility and democratizes safety work
EXPLANATION
Mala explains that releasing AI red‑team tooling as open‑source lowers barriers for organisations to evaluate and safeguard AI systems. By making the software freely available, more stakeholders can participate in safety work, which she views as low‑risk but high‑impact.
EVIDENCE
She states that Humane Intelligence will open its AI red-team software under an open-source license, increasing accessibility for the broader community and providing opportunities for safer AI development [33-38]. She also notes that open-sourcing evaluation tools carries minimal downside while empowering many users to evaluate systems that affect their lives [262-266].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S2 notes that open‑source AI red‑team software broadens participation beyond developers and democratizes safety work, while S18 emphasizes that sharing safety tools reduces duplicated effort and promotes wider accessibility.
MAJOR DISCUSSION POINT
Open‑source expands accessibility of AI safety tools
AGREED WITH
Tarunima Prabhakar, Sanket Verma
Argument 2
Contextual red teaming using subject‑matter experts uncovers failure points and informs guardrail design
EXPLANATION
Mala describes AI red‑team exercises that bring together experts from specific domains to create realistic scenarios. These contextual tests reveal where models fail, guiding the design of appropriate safeguards.
EVIDENCE
She outlines the process of assembling subject-matter experts to run structured scenarios, probing models to identify failure points and inform guardrails such as refusal mechanisms or classification adjustments [14-20].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S2 describes the use of subject‑matter experts to build structured, domain‑specific scenarios that surface model failures and guide the design of guardrails such as refusal mechanisms.
MAJOR DISCUSSION POINT
Subject‑matter expert‑driven red teaming identifies model weaknesses
AGREED WITH
Tarunima Prabhakar
Argument 3
Open‑source red‑team tooling (to be released) will make rigorous evaluation more widely available
EXPLANATION
Mala notes that Humane Intelligence plans to release its AI red‑team software under an open‑source license later in the year. This will enable many organisations to adopt rigorous evaluation practices without building tools from scratch.
EVIDENCE
She mentions the upcoming open-source release of their AI red-team software, which will increase accessibility for the broader community [33-38].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S2 reports that Humane Intelligence will release its AI red‑team software under an open‑source license later in the year, enabling many organisations to adopt rigorous evaluation practices without building tools from scratch.
MAJOR DISCUSSION POINT
Open‑source tooling democratizes rigorous AI evaluation
Argument 4
Propose “eval cards” as interoperable standards to enable reproducible, comparable evaluations
EXPLANATION
Mala proposes a standardized “eval card” format that could be shared and reused across projects, allowing consistent evaluation reporting and easier comparison of results. She links this to the need for interoperable outputs.
EVIDENCE
She discusses the idea of an eval card as an interoperable standard that could be uploaded into software to replicate evaluations, and stresses the importance of standardising outputs for comparability [98-103].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S2 proposes the “eval card” format as an interoperable standard that can be uploaded into software to replicate evaluations and facilitate reproducible, comparable results.
MAJOR DISCUSSION POINT
Standardised eval cards for reproducible AI assessments
Argument 5
Lack of clear provenance and credentialing for AI‑written code burdens maintainers and threatens project health
EXPLANATION
Mala highlights that AI‑generated contributions often lack proper attribution and provenance, making it difficult for maintainers to assess quality and responsibility. This obscures who authored code and can increase maintenance workload.
EVIDENCE
She describes how credentialing systems reward human contributors, but AI-generated “slop” PRs undermine this, creating extra review burden and ambiguity about code provenance [187-196].
MAJOR DISCUSSION POINT
Missing provenance of AI‑generated contributions challenges maintainers
AGREED WITH
Sanket Verma, Ashwani Sharma
Argument 6
Ontology‑based mapping of problem spaces helps generate focused, representative prompts and improves reproducibility
EXPLANATION
Mala suggests using ontologies to model the relationships within a problem domain, which guides the creation of targeted prompts for red‑team scenarios. This structured approach enhances reproducibility and facilitates future modifications.
EVIDENCE
She explains that an ontology can capture clauses, demographics, and power structures, allowing systematic prompt generation and easier replication when models change [290-295].
MAJOR DISCUSSION POINT
Ontologies structure problem spaces for better red‑team prompts
AGREED WITH
Tarunima Prabhakar
Argument 7
Benchmarks should be built after red‑team discovery to ensure they measure the right problem
EXPLANATION
Mala argues that benchmarks are only meaningful if they are based on insights from prior red‑team exercises that identify the actual failure modes. Starting with red‑teaming ensures benchmarks target the correct issues.
EVIDENCE
She recounts a case where an organization wanted a benchmark without knowing the specific problem, illustrating that building benchmarks without prior red-team discovery can misdirect effort [340-357].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S2 stresses that effective benchmarks are derived from insights gained during prior red‑team exercises, ensuring they target the actual failure modes identified.
MAJOR DISCUSSION POINT
Red‑teaming precedes effective benchmark creation
AGREED WITH
Tarunima Prabhakar
Argument 8
Clear definition of evaluation goals (e.g., hallucination vs bias in specific languages) is prerequisite for meaningful benchmarks
EXPLANATION
Mala stresses that before constructing a benchmark, organisations must specify what they aim to measure, such as hallucinations in Yoruba or bias in Hausa. Without clear goals, benchmarks may assess irrelevant aspects.
EVIDENCE
She asks clients whether they aim to benchmark hallucinations or bias in particular languages and notes the confusion when goals are undefined [345-352].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S16 highlights the importance of specifying concrete evaluation goals such as hallucination in Yoruba or bias in Hausa, and S2 reinforces that benchmarks must be grounded in clearly defined objectives.
MAJOR DISCUSSION POINT
Defining evaluation objectives is essential for useful benchmarks
AGREED WITH
Tarunima Prabhakar
Argument 9
Open‑source evaluation tools do not automatically make data open; distinction between open‑source software and open data must be managed
EXPLANATION
Mala points out that releasing software under an open‑source license does not guarantee that the datasets produced are also open. Proper governance is needed to avoid conflating open code with open data.
EVIDENCE
She explains the difference between open-source software and open data, noting that organisations can generate closed data with open-source tools or vice-versa, which can cause contention [262-270].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S3 explicitly states that open‑source software does not imply open data, and S2 discusses the need for separate governance of software licensing and data openness.
MAJOR DISCUSSION POINT
Separating open‑source software from open data policies
T
Tarunima Prabhakar
4 arguments181 words per minute1600 words529 seconds
Argument 1
Open source enables sharing of guardrails and evaluation stacks, reducing duplicated effort especially for global‑majority contexts
EXPLANATION
Tarunima argues that open‑sourcing guardrails and evaluation pipelines prevents multiple organisations from reinventing the same solutions, which is especially important for resource‑constrained regions.
EVIDENCE
She explains that when working on global-majority geographies like India, organisations lack resources to rebuild tools, so sharing guardrails and evaluation stacks avoids duplicated effort and promotes safer applications [40-45].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S2 and S18 argue that shared open‑source guardrails and evaluation pipelines prevent duplicated effort, which is critical for resource‑constrained, global‑majority regions.
MAJOR DISCUSSION POINT
Shared open‑source guardrails reduce duplication for global‑majority regions
AGREED WITH
Mala Kumar, Sanket Verma
Argument 2
Automated prompt generation (using LLMs) can speed up scenario creation, but human oversight remains essential, especially for low‑resource languages
EXPLANATION
Tarunima describes using LLMs to generate prompts from thematic inputs, which can accelerate scenario building, yet stresses that human expertise is still needed, particularly when models struggle with Indian languages.
EVIDENCE
She recounts attempts to generate prompts from themes for Indian languages, noting limited model capability for spoken Hindi/Tamil and the continued need for human writing and validation [296-304].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S2 mentions the use of LLMs to generate prompts from thematic inputs but stresses that human validation is required, particularly for low‑resource Indian languages where model performance is limited.
MAJOR DISCUSSION POINT
LLM‑generated prompts aid automation but need human validation for low‑resource languages
AGREED WITH
Mala Kumar
Argument 3
LLMs can serve as judges, but reliance on a single model amplifies bias; spot‑checks by humans are still required
EXPLANATION
Tarunima emphasizes that while LLMs can act as evaluators, using a single model risks propagating its own biases, so occasional human verification is necessary to ensure trustworthy judgments.
EVIDENCE
She recommends always performing a small human spot-check even when LLMs act as judges, noting that LLM judges inherit the same language limitations as the models they evaluate [324-328].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S2 and S11 caution that using a single LLM as a judge can propagate its own biases, recommending periodic human spot‑checks to ensure trustworthy evaluations.
MAJOR DISCUSSION POINT
Human spot‑checks needed when LLMs act as evaluation judges
AGREED WITH
Mala Kumar, Ashwani Sharma
Argument 4
Governments and standards bodies need simple, maintainable frameworks; lack of in‑house expertise makes contextual, domain‑specific benchmarks critical
EXPLANATION
Responding to the audience, Tarunima highlights that public institutions require straightforward, reusable benchmarking frameworks, especially when they lack specialised AI expertise, and that benchmarks must be tailored to specific linguistic and domain contexts.
EVIDENCE
She adds that bias and other concerns differ across domains (e.g., maternal health vs gendered language) and that organisations should list what they intend to measure before building benchmarks, emphasizing contextual relevance [358-370].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S2 records the audience’s call for a risk‑framework and simple, context‑aware benchmarking standards suitable for institutions with limited AI expertise.
MAJOR DISCUSSION POINT
Need for simple, context‑aware benchmarking frameworks for governments
AGREED WITH
Mala Kumar
S
Sanket Verma
4 arguments182 words per minute1592 words522 seconds
Argument 1
Community contributions and datasets are vital for sustaining evaluation tools and advancing scientific open‑source stacks
EXPLANATION
Sanket stresses that the health of open‑source scientific projects depends on active community involvement, including contributions of data sets and techniques, which keep projects alive and relevant.
EVIDENCE
He notes that the projects used in research and production have wonderful communities, and that evaluations and red-team efforts would benefit from community inputs, datasets, and techniques [46-50].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S2 highlights that vibrant community contributions, including datasets and techniques, are essential for the health and longevity of open‑source evaluation ecosystems; S18 reinforces the value of shared safety tools.
MAJOR DISCUSSION POINT
Community input sustains open‑source evaluation ecosystems
AGREED WITH
Mala Kumar, Ashwani Sharma
Argument 2
Leverage adversarial machine‑learning techniques (black‑box/white‑box) for systematic AI evaluations
EXPLANATION
Sanket suggests adapting established adversarial ML methods—originally used for vision models—to evaluate LLMs, employing both black‑box and white‑box attacks to probe model robustness.
EVIDENCE
He references the existing field of adversarial machine learning that injects attacks into models and proposes applying similar techniques to textual models and LLMs [133-138].
MAJOR DISCUSSION POINT
Applying adversarial ML to evaluate LLM robustness
Argument 3
Model‑to‑model red teaming (using one model to attack another) can automate discovery of vulnerabilities
EXPLANATION
Sanket describes a scenario where one model is used to generate adversarial inputs against another model, enabling automated discovery of weaknesses through reinforcement‑learning loops.
EVIDENCE
He mentions Lilian Wang’s concept of model-to-model red teaming, where a model red-teams another using reinforcement learning and stochastic adjustments [317-321].
MAJOR DISCUSSION POINT
Using one model to red‑team another automates vulnerability discovery
Argument 4
AI‑generated pull requests (e.g., massive OCaml PR, Matplotlib agent) create maintenance overhead and raise policy questions
EXPLANATION
Sanket recounts two recent incidents where AI‑generated code was submitted as huge pull requests, causing maintainers to spend extra time reviewing and ultimately leading to policy discussions about non‑human contributions.
EVIDENCE
He describes a 13,000-line OCaml PR generated by a user who used ChatGPT, which was closed after extensive discussion, and a Matplotlib agent that submitted code, was labeled non-human, posted a critical blog, then apologized after dialogue, highlighting maintenance burdens and policy gaps [151-180].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S2 discusses recent incidents of large AI‑generated pull requests that burden maintainers and spark policy debates, and S17 raises broader governance concerns around AI‑generated code contributions.
MAJOR DISCUSSION POINT
AI‑generated PRs increase maintenance load and trigger policy debates
A
Ashwani Sharma
3 arguments152 words per minute1324 words521 seconds
Argument 1
Growing Indian open‑source ecosystem (Indic LM Arena) illustrates how regional communities can drive multilingual evaluation
EXPLANATION
Ashwani points to the Indic LM Arena launched by IIT Madras as an example of how local open‑source initiatives adapt global research for Indian languages, fostering community participation and multilingual evaluation.
EVIDENCE
He notes that the Indic LM Arena builds on Berkeley’s work, adapts it for Indian contexts and languages, and that a community is being formed around it to evaluate models for Indic languages [66-71].
MAJOR DISCUSSION POINT
Regional open‑source projects enable multilingual AI evaluation
AGREED WITH
Sanket Verma, Mala Kumar
Argument 2
“AI slop” PRs during events like Hacktoberfest illustrate unsustainable contribution patterns and the need for governance
EXPLANATION
Ashwani highlights that during Hacktoberfest many contributors submit low‑quality, AI‑generated code, overwhelming maintainers and prompting calls for better governance of such contributions.
EVIDENCE
He references the surge of AI-generated pull requests during Hacktoberfest, citing the Codot library as a top example of “AI slop” PRs and noting maintainers’ requests to GitHub to curb the practice [198-208].
MAJOR DISCUSSION POINT
AI‑generated low‑quality PRs during hack events threaten project sustainability
Argument 3
Clustering and other data‑driven techniques aid in identifying high‑impact failure modes for targeted testing
EXPLANATION
Ashwani mentions that clustering can reveal distinct behavior categories in model outputs, helping teams focus testing resources on the most critical failure modes.
EVIDENCE
He states that clustering proved useful for finding classifications of behaviors that were not obvious initially, guiding where to concentrate testing effort [313-316].
MAJOR DISCUSSION POINT
Clustering helps prioritize testing of critical model failures
A
Audience
1 argument189 words per minute515 words162 seconds
Argument 1
Governments and standards bodies need simple, maintainable frameworks; lack of in‑house expertise makes contextual, domain‑specific benchmarks critical
EXPLANATION
The audience member asks for guidance on risks of open‑source scaling and seeks a straightforward framework that governments can adopt despite limited technical capacity, emphasizing the need for contextual benchmarks.
EVIDENCE
The participant raises concerns about open-source scaling risks, asks for a risk-framework, and notes limited expertise in institutions, prompting a response about contextual benchmarking and the importance of simple, maintainable standards [257-260].
EXTERNAL EVIDENCE (KNOWLEDGE BASE)
S2 records the audience’s call for a risk‑framework and simple, context‑aware benchmarking standards suitable for institutions with limited AI expertise.
MAJOR DISCUSSION POINT
Need for simple, context‑aware benchmarking frameworks for public institutions
AGREED WITH
Mala Kumar, Tarunima Prabhakar
Agreements
Agreement Points
Open‑source AI evaluation tools democratise safety work and lower barriers for organisations
Speakers: Mala Kumar, Tarunima Prabhakar, Sanket Verma
Open source AI evaluation software expands accessibility and democratizes safety work Open source enables sharing of guardrails and evaluation stacks, reducing duplicated effort especially for global‑majority contexts AI‑generated pull requests … raise policy questions
All three panelists stress that releasing AI red-team and evaluation tooling under an open-source licence makes safety work more accessible, avoids duplicated effort-particularly for resource-constrained regions-and creates a need for clear contribution policies to manage AI-generated code [33-38][40-45][262-266][180-182].
POLICY CONTEXT (KNOWLEDGE BASE)
This view aligns with calls to democratise AI safety for resource-constrained organisations, especially in the Global South, as highlighted in the ‘Driving Social Good with AI’ panel and Cognizant’s Global South AI Safety Network initiatives [S30][S32]. Open-source contributions from millions of developers further lower entry barriers [S40].
Active community contributions are essential for the sustainability and evolution of open‑source AI projects
Speakers: Sanket Verma, Mala Kumar, Ashwani Sharma
Community contributions and datasets are vital for sustaining evaluation tools and advancing scientific open‑source stacks Lack of clear provenance and credentialing for AI‑written code burdens maintainers and threatens project health Growing Indian open‑source ecosystem (Indic LM Arena) illustrates how regional communities can drive multilingual evaluation
Sanket highlights the vital role of community in scientific stacks, Mala points out the maintenance burden caused by missing provenance of AI-generated contributions, and Ashwani describes how a regional open-source effort (Indic LM Arena) builds a community around multilingual evaluation, all underscoring community as a cornerstone for project health [46-50][187-196][66-71].
POLICY CONTEXT (KNOWLEDGE BASE)
The importance of community contributions is underscored by the scale of open-source participation (150 million developers) and the emphasis on local stakeholder involvement in AI infrastructure and public-service contexts [S40][S41].
Human‑in‑the‑loop oversight remains necessary even when using LLMs for evaluation or prompt generation
Speakers: Mala Kumar, Tarunima Prabhakar, Ashwani Sharma
LLMs can serve as judges, but reliance on a single model amplifies bias; spot‑checks by humans are still required LLM‑generated prompts aid automation but need human validation for low‑resource languages Human insight is essential for first‑level testing and interpreting model behaviour
Mala and Tarunima both argue that LLM judges must be complemented by human spot-checks to avoid propagating bias, while Ashwani stresses that human expertise is still required to validate generated prompts, especially for under-represented languages [324-328][326-328][304-311].
POLICY CONTEXT (KNOWLEDGE BASE)
Multiple sources warn that automation cannot replace human judgment; human-in-the-loop remains essential to avoid loss of agency and biased outcomes [S33][S34][S37].
Contextual red‑team exercises with subject‑matter experts uncover failure points and guide guardrail design
Speakers: Mala Kumar, Tarunima Prabhakar
Contextual red teaming using subject‑matter experts uncovers failure points and informs guardrail design Automated prompt generation (using LLMs) can speed up scenario creation, but human oversight remains essential, especially for low‑resource languages
Mala describes assembling domain experts to create structured scenarios that reveal model weaknesses, and Tarunima adds that while LLMs can generate prompts from thematic inputs, human review is needed to ensure relevance for specific contexts [14-20][296-304].
Benchmarks should be derived from red‑team findings and have clearly defined evaluation goals
Speakers: Mala Kumar, Tarunima Prabhakar
Benchmarks should be built after red‑team discovery to ensure they measure the right problem Clear definition of evaluation goals (e.g., hallucination vs bias in specific languages) is prerequisite for meaningful benchmarks Governments and standards bodies need simple, maintainable frameworks; lack of in‑house expertise makes contextual, domain‑specific benchmarks critical
Mala argues that effective benchmarks must follow red-team insights and be goal-specific, while Tarunima reinforces the need for simple, context-aware benchmarking frameworks for public institutions lacking deep AI expertise [340-357][358-370][345-352].
Using structured, ontology‑based representations of problem spaces improves prompt generation and reproducibility of red‑team scenarios
Speakers: Mala Kumar, Tarunima Prabhakar
Ontology‑based mapping of problem spaces helps generate focused, representative prompts and improves reproducibility Automated prompt generation (using LLMs) can speed up scenario creation, but human oversight remains essential, especially for low‑resource languages
Mala proposes ontologies to model domain relationships for systematic prompt creation, and Tarunima describes using thematic inputs to generate prompts, both supporting a structured approach to scenario design [290-295][296-304].
Similar Viewpoints
Both emphasize that open‑sourcing evaluation tools lowers barriers and prevents redundant development, particularly benefiting under‑resourced regions [33-38][40-45][262-266].
Speakers: Mala Kumar, Tarunima Prabhakar
Open source AI evaluation software expands accessibility and democratizes safety work Open source enables sharing of guardrails and evaluation stacks, reducing duplicated effort especially for global‑majority contexts
Both highlight the maintenance challenges posed by AI‑generated contributions and the need for clear provenance and community‑driven stewardship [46-50][187-196].
Speakers: Sanket Verma, Mala Kumar
Community contributions and datasets are vital for sustaining evaluation tools and advancing scientific open‑source stacks Lack of clear provenance and credentialing for AI‑written code burdens maintainers and threatens project health
Both advocate for systematic, data‑oriented methods (clustering, ontologies) to structure red‑team testing and focus effort on critical failure modes [313-316][290-295].
Speakers: Ashwani Sharma, Mala Kumar
Clustering and other data‑driven techniques aid in identifying high‑impact failure modes for targeted testing Ontology‑based mapping of problem spaces helps generate focused, representative prompts and improves reproducibility
Both stress that public institutions need straightforward, context‑specific benchmarking frameworks with clearly defined objectives to be effective [345-352][357-370].
Speakers: Mala Kumar, Audience
Clear definition of evaluation goals (e.g., hallucination vs bias in specific languages) is prerequisite for meaningful benchmarks Governments and standards bodies need simple, maintainable frameworks; lack of in‑house expertise makes contextual, domain‑specific benchmarks critical
Unexpected Consensus
Both panelists and the audience see low downside to open‑source AI evaluation tools despite concerns about scaling risks
Speakers: Mala Kumar, Audience
Open source AI evaluation software expands accessibility and democratizes safety work Governments and standards bodies need simple, maintainable frameworks; lack of in‑house expertise makes contextual, domain‑specific benchmarks critical
Mala argues that open-sourcing evaluation tools carries minimal risk while empowering users, whereas the audience worries about risks of scaling open-source approaches; both converge on the view that the benefits outweigh the downsides and that simple frameworks can mitigate concerns [262-266][257-260].
POLICY CONTEXT (KNOWLEDGE BASE)
Panelists reported low perceived downside of open-source evaluation tools, echoing the optimism expressed in the ‘Driving Social Good with AI’ discussion while acknowledging the need for governance structures [S30][S44].
Convergence on the need for policy guidance around AI‑generated code contributions
Speakers: Sanket Verma, Mala Kumar
AI‑generated pull requests … raise policy questions Lack of clear provenance and credentialing for AI‑written code burdens maintainers and threatens project health
Sanket recounts incidents where AI-generated PRs caused maintenance overload and sparked policy debates, while Mala points out the broader issue of missing provenance and credentialing, together highlighting an unexpected consensus on the urgency of establishing contribution policies for AI-generated code [180-182][187-196].
POLICY CONTEXT (KNOWLEDGE BASE)
There is a growing consensus for policy frameworks governing AI-generated code, with suggestions for tiered access and differentiated governance at capability levels, and calls for multi-stakeholder policy roadmaps [S31][S46].
Overall Assessment

The panel shows strong consensus that open‑source tools, community involvement, and structured, human‑guided red‑team processes are key to safe, sustainable AI deployment. Benchmarks should be grounded in red‑team findings and tailored to specific contexts, especially for governments with limited expertise. Concerns about AI‑generated contributions and provenance are shared, prompting calls for clear policies.

High consensus across most speakers on the importance of open‑source, community, and human oversight, indicating a unified direction for future AI evaluation practices and policy development.

Differences
Different Viewpoints
Perceived risks of scaling open‑source AI evaluation tools
Speakers: Audience, Mala Kumar
Audience worries that open-source scaling may introduce risks such as low-quality code and other loopholes, and asks for a risk-framework for governments and institutions [257-260] Mala argues that open-sourcing AI red-team software is low-stakes, with minimal downside while empowering many users to evaluate systems that affect their lives [262-266]
The audience highlights potential dangers of open-source expansion, while Mala downplays these concerns, asserting that the benefits outweigh the risks and that the approach carries little downside [257-260][262-266].
POLICY CONTEXT (KNOWLEDGE BASE)
Scaling open-source evaluation tools raises concerns about misuse, capability diffusion, and definitional disputes, as noted in debates on open-source AI risks and governance needs [S31][S44][S47].
Unexpected Differences
Trust in automated evaluation versus need for human oversight
Speakers: Sanket Verma, Tarunima Prabhakar
Sanket promotes model-to-model red-teamning and adversarial ML techniques to automate vulnerability discovery, suggesting a largely automated pipeline [317-321][133-138] Tarunima cautions that even when LLMs act as judges, human spot-checks are essential because LLMs inherit the same language limitations and biases as the models they evaluate [324-328]
Sanket envisions a highly automated, model-driven red-teamning process, whereas Tarunima stresses that automation cannot replace human validation, especially for nuanced or low-resource contexts, revealing an unexpected tension between automation optimism and cautionary human-in-the-loop advocacy [317-321][324-328].
POLICY CONTEXT (KNOWLEDGE BASE)
The tension between trust in automated evaluation and the necessity of human oversight is reflected in literature on over-reliance on algorithms, loss of agency, and the need for human contestation [S33][S34][S37][S39].
Overall Assessment

The panel largely converged on the importance of open‑source, community‑driven AI evaluation and the need for better governance of AI‑generated contributions. Disagreements were limited to the perceived risks of open‑source scaling (audience vs. Mala) and the degree of automation appropriate for red‑teamning (Sanket vs. Tarunima). Most divergences were methodological rather than ideological, focusing on how best to achieve shared goals such as democratizing safety tools, scaling red‑teamning, and managing AI‑generated code.

Low to moderate. The core objectives—enhancing AI safety, fostering open‑source collaboration, and improving evaluation practices—were widely shared. The few points of contention revolve around risk perception and the balance between automation and human oversight, suggesting that while consensus exists on direction, further dialogue is needed to align on implementation strategies.

Partial Agreements
All three agree that open‑source approaches are essential for broader participation and sustainability, but they differ on the primary mechanism: Mala focuses on releasing tooling, Tarunima on sharing guardrails, and Sanket on community‑driven contributions and datasets [33-38][40-45][46-50].
Speakers: Mala Kumar, Tarunima Prabhakar, Sanket Verma
Mala promotes open-source AI red-team tooling to democratize safety work [33-38] Tarunima stresses that open-source guardrails and evaluation stacks reduce duplicated effort, especially for global-majority contexts [40-45] Sanket highlights the importance of community contributions and datasets to sustain evaluation tools [46-50]
All aim to scale red‑teamning, yet they advocate different technical routes: structured ontologies, LLM‑generated prompts, or model‑to‑model adversarial loops [290-295][296-304][317-321].
Speakers: Mala Kumar, Tarunima Prabhakar, Sanket Verma
Mala proposes ontology-based mapping of problem spaces to generate focused prompts and improve reproducibility [290-295] Tarunima describes using LLMs to auto-generate prompts from thematic inputs, while retaining human oversight for low-resource languages [296-304] Sanket suggests model-to-model red-teamning (one model attacking another) to automate vulnerability discovery [317-321]
They concur that AI‑generated contributions create maintenance challenges, but differ in emphasis: Sanket on formal policy, Mala on provenance/credentialing, and Ashwani on community‑level governance and event‑specific controls [151-180][187-196][198-208].
Speakers: Sanket Verma, Mala Kumar, Ashwani Sharma
Sanket calls for clear policies on non-human contributions after recounting AI-generated pull-request incidents [151-180] Mala points out that AI-generated code lacks provenance and burdens maintainers, stressing credentialing systems [187-196] Ashwani highlights the surge of low-quality AI-generated PRs during Hacktoberfest and urges governance actions [198-208]
Takeaways
Key takeaways
Open‑source AI evaluation tools dramatically increase accessibility and enable shared guardrails, reducing duplicated effort especially for global‑majority contexts. Community contributions—code, datasets, expertise—are essential for sustaining and advancing AI red‑team and evaluation ecosystems. Contextual red‑team­ing with subject‑matter experts uncovers failure points; open‑source red‑team tooling (to be released this year) will broaden participation. Standardized artefacts such as “eval cards” are proposed to make evaluations reproducible and comparable across projects. Adversarial ML techniques and model‑to‑model red‑team­ing can be adapted for LLMs to automate vulnerability discovery. AI‑generated pull requests (e.g., massive OCaml or Matplotlib PRs) create maintenance overhead and raise provenance, credentialing, and policy challenges. Scaling red‑team­ing benefits from ontology‑based problem mapping, automated prompt generation, and data‑driven clustering, but human oversight remains critical. Benchmarks should be derived after red‑team insights to ensure they measure the correct problem; clear goal definition is prerequisite. Open‑source software does not automatically imply open data; the distinction must be managed when releasing evaluation tools. Governments and standards bodies need simple, maintainable frameworks and domain‑specific benchmarks, especially for low‑resource languages.
Resolutions and action items
Humane Intelligence will open‑source its AI red‑team software later this year. Mala Kumar suggested developing an interoperable “eval‑card” standard for sharing evaluation specifications. Ashwani Sharma highlighted the need for community‑driven mapping of large code‑bases to aid newcomer contributions. Panelists encouraged participants to contribute to regional initiatives such as the Indic LM Arena for multilingual evaluation.
Unresolved issues
How to create and enforce policies for AI‑generated (non‑human) pull requests and ensure proper provenance. Effective mechanisms for scaling human‑in‑the‑loop red‑team­ing without overwhelming resources. Standardization of benchmarks that remain relevant across diverse languages and domains; lack of concrete framework. How governments and institutions without deep AI expertise can adopt, maintain, and govern open‑source evaluation tools and benchmarks. Balancing the use of LLMs as judges with the risk of amplifying biases; no consensus on sustainable evaluation pipelines.
Suggested compromises
Combine automation (LLM‑generated prompts, clustering, ontology mapping) with limited human spot‑checks to retain quality while improving scalability. Adopt a “reductive architecture” approach: start from a large model and iteratively remove unsafe behaviours rather than building from scratch. Allow AI‑generated contributions but require explicit labeling and human review before merging, addressing provenance concerns. Use open‑source evaluation software while keeping data private or proprietary when necessary, acknowledging the software‑vs‑data distinction.
Thought Provoking Comments
We do focus on AI red teaming… we create structured scenarios, bring subject‑matter experts together, and probe models to find failure points before building guardrails. This is distinct from the usual benchmark‑centric approach.
Introduces a concrete, security‑inspired methodology (AI red teaming) as an alternative to the dominant benchmark mindset, framing evaluation as a proactive, context‑driven process.
Sets the thematic foundation for the whole panel, steering the conversation from generic AI hype toward concrete evaluation practices. It prompts other panelists (e.g., Tarunima, Ashwani) to discuss open‑source tools and community involvement in red‑team activities.
Speaker: Mala Kumar
A user generated a 13,000‑line pull request with ChatGPT and another agentic AI submitted a massive PR to Matplotlib, which was closed because the project has no policy for non‑human contributions. The incident sparked a public blog‑post backlash and later an apology.
Provides a vivid, real‑world illustration of how LLM‑generated code can overwhelm maintainers, exposing a gap in governance policies for AI‑produced contributions.
Acts as a turning point, moving the dialogue from abstract benefits of AI to concrete risks. It triggers Mala’s discussion on provenance, credentialing, and the need for explicit contribution policies, and frames the subsequent debate on maintainability.
Speaker: Sanket Verma
Generating a bunch of sloppy code via AI diminishes the credentialing system of open‑source, makes maintainers’ jobs harder, and raises questions about provenance and where to draw the line on AI‑generated contributions.
Highlights the practical governance challenge of attribution and trust in a world where AI can produce code at scale, linking technical noise to community reputation systems.
Deepens the policy discussion initiated by Sanket’s PR story, leading participants to consider tagging, disclosure, and the broader implications for community health and reviewer workload.
Speaker: Mala Kumar
In the West we have additive architecture (build from nothing up); in India and many Eastern cultures we have reductive architecture (start with a massive block and carve out what we need). AI evaluations are more like reductive architecture – we knock out pieces from a complex model to reach the final safe product.
Offers a culturally grounded metaphor that reframes how evaluation pipelines can be designed, contrasting two architectural mindsets and linking them to AI safety work.
Broadens participants’ conceptual toolkit, influencing later remarks about building evaluation layers, guardrails, and the need to ‘knock out’ unsafe behaviours rather than add layers from scratch.
Speaker: Mala Kumar
Using LLMs to map the entire architecture of a large open‑source codebase can give newcomers a clear picture of functions, data flows, and class connections, making onboarding and contribution decisions much easier.
Proposes a concrete, AI‑driven solution to a known barrier—onboarding contributors to massive projects—linking the discussion of maintainability to practical tooling.
Introduces a new sub‑topic about AI‑assisted contribution workflows, complementing the earlier concerns about PR overload and suggesting a positive use‑case for LLMs in open‑source ecosystems.
Speaker: Ashwani Sharma
An organization serving HIV survivors wants their chatbot to discuss sexual health, which many foundation models flag as unsafe. This shows a scenario where default safety filters conflict with the real needs of users.
Illustrates the ethical nuance that safety mechanisms are not universally appropriate; context‑specific user needs can demand the opposite of what generic guardrails enforce.
Shifts the conversation toward the tension between universal safety policies and localized, culturally sensitive applications, prompting further discussion on customizable guardrails and multicultural evaluation.
Speaker: Tarunima Prabhakar
We are exploring an ontological‑based approach: map the problem space (e.g., human‑rights clauses, demographics) into an ontology, then use proximity and strength of relationships to generate representative prompts and scenarios for red‑teamers.
Introduces a systematic, scalable methodology for constructing red‑team scenarios, moving beyond ad‑hoc checklists toward reproducible, domain‑aware evaluation pipelines.
Directly answers the audience’s question on scaling red‑team efforts, steering the dialogue toward structured, repeatable processes and influencing later suggestions about automation and prompt generation.
Speaker: Mala Kumar
When you use an LLM to judge another LLM, any bias present gets amplified—using the same model as both subject and evaluator can make bias or hallucination problems exponentially worse.
Provides a technical caution about self‑referential evaluation, highlighting a subtle but critical flaw in the emerging practice of LLM‑as‑judge.
Temporarily redirects the conversation from automation optimism to a warning about over‑reliance on AI judges, reinforcing the earlier call for human spot‑checks and influencing the panel’s concluding emphasis on human involvement.
Speaker: Mala Kumar
Overall Assessment

The discussion was shaped by a handful of pivotal remarks that moved the panel from high‑level optimism about AI to a nuanced examination of concrete risks and practical solutions. Mala’s introduction of AI red‑team­ing set the agenda, while Sanket’s anecdote about AI‑generated pull requests exposed an urgent governance gap, prompting a cascade of comments on provenance, credentialing, and policy needs. Cultural framing (additive vs. reductive architecture) and real‑world ethical dilemmas (the HIV‑survivor chatbot) broadened the conversation to include societal context. Proposals for AI‑driven tooling (code‑base mapping) and systematic ontological methods offered constructive pathways forward, and cautions about LLM‑as‑judge kept the dialogue grounded. Collectively, these insights redirected the tone from speculative to action‑oriented, highlighting both the opportunities and the responsibilities that open‑source communities must grapple with in the age of LLMs.

Follow-up Questions
What does maintainability look like in the age of LLMs and AI, and what safeguards or policies should be put in place?
Understanding how AI‑generated contributions affect long‑term project health is crucial for sustainable open‑source ecosystems.
Speaker: Sanket Verma
What projects currently handle AI evaluation in the scientific open‑source stack, and who is responsible for them?
Identifying existing evaluation efforts helps avoid duplication and enables community coordination.
Speaker: Sanket Verma
What specific frameworks or tools can enable people to create new evaluation frameworks themselves?
Providing reusable tooling lowers the barrier for organizations to build their own AI evaluation pipelines.
Speaker: Ashwani Sharma
Can we develop a standardized open‑source evaluation artifact (e.g., an "Eval Card") analogous to Model Cards, and make it interoperable?
A common, machine‑readable format would allow reproducible, comparable evaluations across models and contexts.
Speaker: Mala Kumar
How can AI evaluation processes be made accessible to non‑technical audiences and program staff?
Ensuring that NGOs and social‑sector workers can use evaluation tools broadens impact beyond technical teams.
Speaker: Tarunima Prabhakar
Can concepts from adversarial machine learning (black‑box/white‑box red teaming) be adapted for textual LLMs?
Adapting proven robustness techniques to language models could provide systematic security testing for LLMs.
Speaker: Sanket Verma
How should multicultural acceptability criteria be defined for AI red‑team responses across languages and cultural contexts?
Defining culturally appropriate success/failure thresholds is essential for fair evaluation in diverse settings.
Speaker: Mala Kumar
What are the risks and loopholes associated with an open‑source approach to scaling AI (beyond bad code contributions), compared with closed or open‑weight models?
Understanding broader security, governance, and ethical risks informs policy decisions for open‑source AI development.
Speaker: Audience (member)
How can red‑team pipelines be scaled—what tools or methods can automate gap identification, prompt generation, and response evaluation?
Automation is needed to keep pace with rapid model releases while maintaining thorough testing.
Speaker: Audience (member)
How can ontological‑based approaches improve mapping of problem spaces for red‑team scenario generation and replication?
Ontologies can provide structured, repeatable scenario creation, improving consistency and scalability of red‑team efforts.
Speaker: Mala Kumar
Can automation generate culturally relevant prompts from thematic inputs, especially for low‑resource Indian languages?
Automated prompt generation would reduce manual effort and increase coverage of multilingual evaluation.
Speaker: Tarunima Prabhakar
How effective is clustering for discovering behavior classifications in model outputs to focus evaluation effort?
Clustering can highlight emergent failure modes, helping prioritize human review and resource allocation.
Speaker: Ashwani Sharma
Is using LLMs as judges for evaluations sustainable, given the risk of bias amplification?
Reliance on AI judges may propagate or magnify existing biases, threatening evaluation validity.
Speaker: Ashwani Sharma
How should governments and standard institutions approach benchmarking for local language models, given limited in‑house expertise?
Guidance on benchmark design and governance is needed to create reliable standards for under‑represented languages.
Speaker: Audience (member)
How can benchmarks be maintained over time by institutions lacking in‑house experts?
Sustainable benchmark upkeep requires processes, tooling, and possibly community support to remain relevant.
Speaker: Ashwani Sharma

Disclaimer: This is not an official session record. DiploAI generates these resources from audiovisual recordings, and they are presented as-is, including potential errors. Due to logistical challenges, such as discrepancies in audio/video or transcripts, names may be misspelled. We strive for accuracy to the best of our ability.