Driving Social Good with AI_ Evaluation and Open Source at Scale
20 Feb 2026 14:00h - 15:00h
Driving Social Good with AI_ Evaluation and Open Source at Scale
Summary
The panel examined how open-source software projects can remain maintainable and trustworthy as large language models (LLMs) and agentic AI increasingly generate code contributions. Sanket Verma introduced his NumFOCUS role and framed the discussion around the emerging “AI slot PR” phenomenon and the need for new safeguards and policies [1-2][6-7][151-181].
Mala Kumar described “AI red teaming” as a structured, contextual evaluation method that brings domain experts together to probe model failures, emphasizing that Humane Intelligence plans to release its red-team tooling under an open-source licence [14-18][33-34]. Tarunima Prabhakar added that open-source solutions are crucial for resource-constrained regions such as India, where shared evaluation stacks can prevent duplicated effort across organisations [40-45].
Sanket highlighted the vital role of community contributions in sustaining scientific libraries [46-49] and recounted two recent incidents: an OCaml pull request of 13 000 lines generated by a chat-GPT interaction that burdened maintainers [152-168], and an agentic AI-generated PR to Matplotlib that was rejected and led to a brief controversy [173-179]. He argued that clear policies on non-human contributions are essential to protect maintainers’ limited capacity [178-181]. Mala warned that undisclosed AI-generated code erodes credentialing systems and obscures provenance, complicating reviewer workloads [187-194][195-197].
To scale red-team efforts, the panel suggested ontological mapping of problem spaces to create representative prompts, especially multilingual ones, and to combine automated prompt generation with human oversight [290-295][300-304][326-328]. They cautioned that using the same LLM as judge can amplify bias, underscoring the need for spot checks by subject-matter experts [330-334][263-266][278-281].
Overall, the participants agreed that open-source AI evaluation tools are still in their infancy, requiring robust standards, human-in-the-loop safeguards, and community-driven policies to ensure sustainable, safe development of AI-enhanced open-source projects.
Keypoints
Major discussion points
– Open-source AI evaluation and red-teaming as a community effort – The panel highlighted that AI red-teaming (structured adversarial testing) is being open-sourced to broaden access, and that vibrant open-source communities are essential for supplying data, techniques, and sustained maintenance of evaluation tools [9-21][32-34][46-49][66-70].
– Maintainability and policy challenges posed by AI-generated contributions – Real-world examples (a massive OCaml PR generated by ChatGPT and an agentic AI submitting a pull request to Matplotlib) illustrate how LLM-driven code submissions increase reviewer workload, raise questions of provenance, and expose the need for clear contribution policies at both project and organizational levels [151-179][184-188].
– Standardisation of evaluation artefacts and benchmarks – Participants argued for interoperable “eval-cards” or model-cards to enable reproducible assessments, but noted current practices are ad-hoc and especially difficult across multilingual, multicultural contexts; the lack of clear problem definition often leads to mis-aligned benchmarks [98-103][135-140][340-357].
– Making evaluation tools usable for non-technical stakeholders – NGOs and program staff often lack engineering capacity; the discussion stressed that evaluation work must be accessible beyond developers, with clear guidance, human-in-the-loop checks, and documentation so that domain experts can safely deploy AI systems [115-118][236-244][278-281].
– Opportunities to automate and scale parts of the evaluation pipeline – Ideas such as using LLMs to map large codebases, generate scenario prompts, apply ontological modelling, or even have models red-team other models were presented as ways to reduce manual effort while still retaining critical human oversight [229-234][290-295][317-321][313-316].
Overall purpose / goal
The panel aimed to explore how the open-source ecosystem can responsibly support the evaluation, red-teaming, and maintainability of AI/LLM systems. It sought to identify current challenges (e.g., AI-generated pull requests, lack of standards, multilingual safety) and to propose community-driven safeguards, policies, and tooling that lower barriers for contributors, NGOs, and other stakeholders while ensuring safe, reliable AI deployments.
Tone of the discussion
The conversation began with an informative and collaborative tone, as speakers introduced their backgrounds and the concept of open-source AI evaluation. As the dialogue progressed, the tone shifted to concerned and problem-focused, highlighting concrete maintenance headaches and policy gaps caused by AI-generated contributions. Toward the end, the tone became optimistic and forward-looking, emphasizing opportunities for automation, community-driven standards, and inclusive participation. Throughout, the panel maintained a constructive, solution-oriented atmosphere.
Speakers
– Sanket Verma – Board of Directors, Numfocus; Technical Committee member, Numfocus; open-source maintainer and advocate for AI/LLM maintainability and policy development. [S4]
– Mala Kumar – Representative of Humane Intelligence; former Director at GitHub (4 years); focuses on AI red-teaming, open-source evaluation tools, and benchmarking frameworks. [S5]
– Ashwani Sharma – Engineer with experience at Google; speaker on open-source community building, multilingual AI evaluation, and the intersection of open-source and agentic AI. [S6]
– Tarunima Prabhakar – Works at TATL (Technology for the Global Majority); focuses on online harms, open-source AI safety, and building open products for global-majority geographies such as India. [S1][S2]
– Audience – Members of the summit audience (industry, academia, non-profits, government) who asked questions about risks of open-source AI scaling, benchmarking, and red-teaming.
Additional speakers:
– None (all speakers in the transcript are covered by the list above).
The panel opened with Sanket Verma introducing himself as a NumFOCUS board member and technical-committee participant, noting that NumFOCUS fiscally sponsors core scientific libraries such as NumPy, SciPy, Pandas and Matplotlib [1-4]. He framed the discussion around the emerging “AI slot PR” phenomenon-large-language-model-generated code submissions-and asked the audience to consider how maintainability, safeguards and policies must evolve in this new era [6-7][151-181].
The conversation was organized around three topics: (1) Evaluation & open-source software, (2) Red-team scaling & open-source tools, and (3) Agentic AI & open-source projects. Mala Kumar defined AI red-teaming as a structured, contextual evaluation method that assembles domain experts to devise adversarial scenarios and probe model weaknesses, rather than relying on generic benchmarks [14-20]. She announced that Humane Intelligence will release its red-team tooling under an open-source licence later in the year, thereby widening access to rigorous safety work [33-34]. Tarunima Prabhakar added that open-source guardrails are especially vital for resource-constrained regions such as India, where sharing evaluation stacks prevents duplicated effort across organisations [40-45].
Sanket emphasized that the scientific stack’s vitality depends on a vibrant contributor base that supplies data, techniques and ongoing maintenance [46-49]. Ashwani Sharma illustrated this with the Indic LM Arena, a community-driven effort that adapts the LA Marina benchmark for Indian languages and invites further contributions to improve multilingual evaluation [66-70].
Sanket recounted two recent incidents that expose the maintenance burden of AI-generated pull requests. In the OCaml project a single PR added roughly 13 000 lines of code produced by ChatGPT, overwhelming maintainers who had to question the author’s intent and ability to fix downstream bugs [152-168]. A similar episode occurred with Matplotlib, where an agentic AI submitted a massive PR, was rejected for lacking a non-human contribution policy, posted a critical blog, then retracted after dialogue [173-179][180-182]. Mala warned that undisclosed AI code erodes credentialing systems, obscures provenance and forces reviewers to expend disproportionate effort [187-196][195-197]. Ashwani noted that “AI slop” PRs have proliferated during events such as Hacktoberfest, prompting community pleas for governance measures [198-208]; the Codot library was cited as ranking top among these low-quality AI-generated PRs, with its maintainers asking GitHub to intervene [198-208].
These stories led to a consensus that clear policies are needed to manage non-human contributions. Sanket called for project-level and umbrella-level guidelines, while Mala pointed out that GitHub is actively discussing the addition of a label to identify AI-generated PRs [180-182][187-196]. The panel agreed that explicit labelling, provenance tracking and reviewer safeguards are essential to protect over-stretched maintainers [151-181][187-196].
Standardising evaluation artefacts was identified as a way to improve reproducibility. Mala suggested developing an interoperable “eval-card” analogous to model cards, enabling users to upload a specification and replicate the same evaluation across contexts [98-103]. She cautioned that current benchmarking practices are ad-hoc, especially across multilingual settings, and that without a well-defined problem space benchmarks can mis-measure the wrong phenomenon [340-357].
Human-in-the-loop oversight was repeatedly stressed. Mala contrasted “additive” Western software architecture with the “reductive” approach common in India, arguing that AI evaluation resembles the latter-we must knock out unsafe behaviours rather than build layers from scratch [80-88][89-91]. Tarunima gave a concrete example: a service for HIV survivors wishes its chatbot to discuss sexual health, yet many foundation models flag such dialogue as unsafe, illustrating that universal safety filters may conflict with local needs [124-130]. Both Mala and Tarunima warned that LLMs used as judges inherit the same biases as the models they evaluate, so spot-checks by subject-matter experts remain indispensable [324-328][326-328].
To scale red-teaming, the panel discussed several complementary techniques. Mala advocated an ontology-based mapping of problem domains (e.g., human-rights clauses, demographic groups) to generate representative prompts and ensure reproducibility [290-295]. Tarunima described using LLMs to auto-generate multilingual prompts from thematic inputs, noting that as LLM capabilities improve, automated prompt generation is expected to become more reliable, though human validation remains crucial for low-resource languages [296-304]. Ashwani highlighted clustering of model outputs to surface distinct behavioural classes that merit focused testing [313-316]. Sanket introduced the idea of model-to-model red-teaming, where one LLM attacks another, potentially automating vulnerability discovery [317-321].
When the audience asked about benchmarking, Mala reiterated that benchmarks should be built after red-team insights to target the correct failure mode, using a clear definition of what is being measured (e.g., hallucinations in Yoruba vs bias in Hausa) [340-352][355-357]. The audience also raised concerns about the risks of “open-weight” (open-model) systems versus open-source software, prompting Mala to distinguish the two: open-source software concerns code transparency and maintenance, whereas open-weight raises separate data-access and model-distribution risks [257-260]. She responded that open-sourcing evaluation tools is low-stakes and largely beneficial, though it is important to prevent non-experts from adjudicating specialised domains [262-266][274-276]. This highlighted a modest disagreement on the perceived risks of open-source scaling.
Sanket suggested that LLMs could map large codebases, visualising functions, data flows and class relationships to help newcomers identify entry points for contribution [229-234], linking this to broader efforts to lower onboarding barriers for massive projects such as NumPy or Matplotlib [227-233]. The panel encouraged participants to engage with community initiatives like the Indic LM Arena and the forthcoming Humane Intelligence red-team suite [33-38][66-70].
The discussion highlighted shared viewpoints that open-source AI evaluation tools, community-driven contributions and human-in-the-loop oversight are essential for safe, sustainable AI development. Points of contention were limited to the perceived risks of open-source scaling and the degree of automation appropriate for red-team pipelines. Unresolved issues include establishing enforceable policies for AI-generated PRs, creating maintainable benchmark frameworks for low-resource languages, and balancing LLM judges with human checks. Action items emerging from the session are: (1) Humane Intelligence’s planned open-source release of its red-team software later this year [33-38]; (2) development of an interoperable “eval-card” standard [98-103]; (3) community-led mapping of large codebases to aid onboarding [229-234]; and (4) continued contribution to regional projects such as the Indic LM Arena to strengthen multilingual evaluation capacity. The panel concluded by underscoring the need for continued collaboration across open-source communities, NGOs, academia and industry to develop sustainable, context-aware AI evaluation practices.
Hello everyone. So my name is Sanket Verma and I serve on the board of directors of Numfocus. Numfocus is a non -profit organization based out of US which is a fiscal sponsor for all the foundational projects used in the AI like NumPy, SciPy, Pandas, Matplotlib. I also serve on the technical committee of Numfocus. I’ve been in the open source space for the last decade. I maintain open source projects and all that stuff. So my focus will be what does the maintainability look like in the age of LLMs and AI. And I think our community has been handling these AI slot PRs for quite some time and it’s about time we start thinking what does it look like, what kind of safeguards should be there, what kind of policies should be there.
And just to make sure that I’m not interrupting you, I’m going to go ahead and start the recording. not sound too pessimistic, there are opportunities as well, like how these agentic AINLMs can be used to lower the barrier for the newcomers and contributors, how they can leverage it.
It’s on, but the button’s not illuminated, so very confusing. Great. So again, we have three topics that we’re going to cover in this panel, and I guess we’ll go ahead and kick it off on the first one. So the first topic is really around the idea of evaluation and open source software. At Humane Intelligence, we do focus on what we call contextual evaluations, so we’re not going to the hyper -automation that a lot of companies like to look at. We don’t also focus on benchmarks, which is kind of the industry darling. What we really focus on is AI red teaming, which is kind of a remnant thing from cybersecurity, where you would basically bring a bunch of people together to try to hack away at whatever tool that you’re building.
With AI red teaming, what we basically do is we create structured scenarios that look at how to build a system that’s going to be able to do that. So we’re going to probe different models. So we’re going to look at how to build a system that’s going to be able to do that. So we’re different directions and we focus on the subject matter expertise. So if, for example, you work in public health or food security or education, we would bring those people together and then have them run through certain scenarios to look at different models and see where the points of failures may occur. And once we have that, we can either take the data and do things like structured data science challenges or we can do benchmarks from there once you have a much better idea of where the failure points, the vulnerabilities may exist in your models in the first place.
One of the ways that I like to think about AI evaluations is really one of my background, which is UX research and design. For those who have ever built software before, it doesn’t matter whether you were starting at basically nothing, you had no idea what your digital intervention was, or you had a very mature software product, there was some kind of method or methodology that would get you to the next stage. We’re at the early stages of AI evaluations right now, meaning there are a lot of gaps and honestly organizations like ours are making it up as we go. But that’s kind of how it goes. with AI systems as it stands. But AI red teaming has turned out to be really interesting for both the capacity building side, so helping people understand what are kind of the inherent flaws or the makeups or the design decisions in AI systems and models, but then also, again, to find the failure points so that if they were to build a guardrail around their system, they would have an idea of what they’re looking at.
Is it refusal on a certain topic? Is it a different classification system for a certain topical area? Is it delving further into the problem space? Is it building a RAG system like Tarunima mentioned? If you need further documentation or something more robust for a certain part. And so there are a lot of different methods that can go about for the mitigations, but in order to get to that point, you have to understand what exactly is the problem in the first place. And so open source software has a really interesting intersection with that and a really interesting means to make that, more accessible. And one of the things we’re doing at Humane Intelligence is we’re doing a lot of work on the AI system.
and thanks to the support of Google .org, is we’re going to be opening up our AI red teaming software through an open source software license. So that will come out later this year. My colleague Adarsh is in the audience. He’s going to be primarily helping us on that, so you can go talk to him if you’ve got technical questions. But we’re really excited about that because, again, it means more accessibility for the broader community. And so with that long -winded explanation, I’d like to turn it to my fellow panelists for their thoughts on why open source and AI evaluations is important.
Yeah, I can just come in on the open source piece. So TATL has been, we’ve been looking at online harms now for over six years, and from the get -go, we were clear that the products that we build have to be open. The specific reason for that is that when you are looking at a lot of global majority, geographies you’re looking at, India, right? often we don’t have the resources to reinvent the wheel. So if one organization, it’s complex enough to build something out once, to then spend the same amount of resources, in this case it would be, as Vala was saying, for red teaming, but if you also had to think about it just in terms of an evaluation stack, which is keeping track of your inputs and outputs.
Or if let’s say we have figured out one way of doing human review or a human evaluation and then figuring out how do you go from there to building a guardrail, that same guardrail is useful for other organizations as well. And we don’t have the resources or the efficient way is for that knowledge to be shared and reused rather than for the limited set of resources to be fractured across six organizations to do the exact same thing. So, So, yeah, like in general, I think if we are trying to build safer applications, build more robust applications in the global majority in India, like we do think open source is actually a big part of doing that.
So I would like to focus on the community aspect of the open source. So all the projects that we have been using in our research and in our academic uses or in the production, they have a wonderful community behind them. And I guess like the evaluations and the red teaming could definitely use the big push from the community, the inputs, the data sets, the different techniques and all that stuff. And the community plays a vital role in sustaining the project and keeping the project moving forward. I guess I’m not familiar with, so I’m mostly from the scientific open source stack, so I’m not sure what the projects are present, who kind of does. the AI evaluation in that space, but I guess they have wonderful community, and it plays a vital role in how this can be relevant depending on the trend it changes every day.
So, actually, it’s very interesting going back many years, actually, and I reveal my age here, but whatever. I used Linux back when there was a magazine called PC Quest, which used to have Slackware Linux coming on its CDs back in the mid-’90s, and, you know, install that thing on, like, a Pentium computer. And for a long time, actually, in India, we were consumers of open source, and we were not so much contributors to open source. When I joined Google, there was this competition called Google Summer of Code. It’s not really… You can’t really call it a competition because it was about contributing to open source, and it wasn’t like there were prizes. Just that the teams which were selected would be paid the equivalent of a summer internship stipend to contribute to open source.
And in a particular year, it just flipped because it was universities. And for the longest time, guess what? The global leader was the University of Marutua in Sri Lanka because some professors just got into this idea that students contributing to open source will learn better software engineering. And they were the global leaders. And then one year, it flipped. And our IITs and IIITs just got on top of that and have stayed on top of that. And I think that somewhere the sentiment changed, and we became very active contributors to open source as the software engineering community in India. And now, with evaluations, things are continuing. Our academic labs publish different forms of evaluation mechanisms and also benefit from things done elsewhere in the world.
And one example that I want to give is that IIT Madras AI for Bharat team lab launched… launched what’s called the Indic LM Arena. And that… That was basically on the basis of the actual LA Marina work that’s happened at Berkeley and making sure that adapt that for Indian context, Indian languages. And now I’m starting to build a community around that. So I’d urge you to consider going there and seeing whether whatever framework that they have going, contribute your insight into whether the models work for the Indic context. And that’s the community and the open source coming together for evaluations. Not so much safety, but more in terms of multilinguality and context.
Great. Yeah, I think a couple final points I’ll just add based on our experience at Humane Intelligence. One thing we’re seeing, obviously, is that the world of LLMs is ever changing and it’s new. I mean, we’re in new territory. And so one of the reasons why open source, we think, is going to be very powerful is because it’s just really complicated, honestly, to read. We need to rebuild, sorry, Adarsh, our software every time. We need to run. retrofitted for another model. And so by creating an open source technology, we’re hoping that more organizations can essentially create a valuation layer in their own tech stack. One of the analogies that I talk about a lot with AI evaluations is architecture.
And I think being here in India is a great example of that. In the West, you know, I grew up in the United States, we have what we call additive architecture. So you basically start with nothing and you build your way up to your final thing. But here in India and a lot of Eastern cultures, you have reductive architecture. So you might start with a giant piece of limestone and basically knock out a bunch of things and then you come up with your final product. That’s kind of what AI evaluations are. So non -algorithmic, non -LOM based software is more additive in that you have to get to the end of the software development life cycle in order to create your final thing.
But with AI based technologies, because you’re starting out with such a complex and robust technology, a lot of what you’re doing is actually knocking out pieces to create the final thing. And so the evaluation layer is actually really important because if you’re trying to do something for social good, especially like a high stakes environment or a high stakes topic, then you have a very robust technology that might actually make your problem worse because people can interact with it in ways that you don’t want them to do. And they can generate things that are actually really harmful in the end. So by creating that internal evaluation layer, we can help people knock out the pieces and essentially create the tool that they want so that they get the result, they get the outputs that are safe and actually additive to their work.
And so the open source technology, we feel, will enable a lot more organizations to, again, create that internal evaluation layer and then get to the next step in achieving their goals with AI for good. All right. We’re going to move on to our second topic now. Yeah, go ahead.
So actually, you spoke about open source software for red teaming. That’s wonderful that you’re creating something that’s reusable for many, many organizations. For the audience, what are some of the things that you’re doing that you’re doing that you’re doing that you’re people could create new frameworks of evaluations by themselves. With the productivity of how you could code with AI tools, what do you think is the effort required to be able
Yeah, it’s a thought that we’ve thought about a long time. If we can create some kind of standardized open source evaluation like ModelCard essentially, if we could do an eval card, if we made that an interoperable standard, then in theory somebody could take an eval card, essentially upload that into the software and then they could replicate that evaluation for their own context. It is something that we’ve thought about quite a lot. I don’t know with this software release if we’ll get there anytime soon, honestly, because we’re just working on that infrastructure piece, but we would like to standardize the outputs that come out eventually so that people can compare apples to apples because that is one of the challenges now with AI evals is that again, everybody…
is kind of making it up as they go. And it’s very hard to replicate all those decisions. It’s very hard to document every single decision, especially in multicultural contexts, which is my not awkward segue into our next topic. But yeah, it’s a good question, and hopefully we’ll get there.
Can I, so I just wanted to add something to what you were saying. This is, you know, some of the organizations that we’ve looked at and just looked at their input outputs is with an organization called Tech for Dev. They have a cohort that they run, and so we’ve been looking at the nonprofits there. And we’ve also looked at certain organizations that are more technically adept. So actually, let me backtrack. So what we’ve noticed is that a lot of nonprofits across a range of capacities, they may or may not have technical expertise in -house, are building out AI applications because I think the market has figured out that process. The market has actually, there are good incentives to make the application development easier.
And so you have a lot of people, you know, I mean, AI chat, bots are actually at this point. fairly easy to build. The second step, which is actually figuring out whether that bot is working for your use case, is where there is actually less investment at the moment, right? And we can have software engineers do some of that automation, but a lot of the non -profits don’t have those software engineers. And I think there is, so on the open source side, when we talk about the software side, I also think there’s another layer that we need to think about, which is how do you make all of these processes accessible to non -technical audiences?
How do you make it accessible to program staff that is actually running, say, a nutrition program on ground? Yeah, I have more to say, but I think I’ll come to it on the multi -level.
Yeah, no, I think that is actually one of the key points, too, because it’s not so evident for a lot of organizations, especially that working in the social sector for social good, they have the program evaluation, they have the overall software. and design UXR, but they don’t necessarily understand there’s also now the model evaluation. So it’s not apparent to a lot of organizations that this is yet another thing they must evaluate because it is kind of deceptively simple, as you know, to build a chatbot. Almost anybody can do it, but then it turns out your chatbot can run amok pretty easily. So you need to test it before you deploy.
I guess we can open it to Q &A in a bit, but I just wanted to bring out one interesting anecdote around context and the need for, say, model cards, contextual use cases. So one of the organizations that we looked at runs a service for basically survivors or caretakers of HIV patients. So they’re also working with adolescents, and they want the adolescents to have conversations around sexual health. And interestingly, what a lot of models, your foundation models, would say is unsafe and discouraged as a conversation is precisely what… they actually want the students to be able, they want the users, the adolescent users, to be able to have that conversation with that service. Because they think that to say that this is unsafe and therefore our service will not engage with this conversation is doing no better than maybe the parents, maybe the society, and they think that’s actually counterproductive to the kind of support they want to provide.
And that’s actually a very interesting problem because in some ways this was our first time listening to a use case where people were saying we actually don’t want the safeguards that the default models are operating with. At the same time, there are a lot of other non -profits that do work with adolescents who actually will not want to encourage that conversation at all. For them, they’re very clear, we don’t want our users to have any conversations about sexual topics with our service. And so I think, again, there are a lot of… emerging issues, we don’t quite know how to resolve all of it, but the only way we can start actually having or moving to some of the solutions faster is by documenting publicly, openly as much as possible, and then having a collective conversation about it.
Yeah, so I think I had done the opening for multicultural, and I have kind of brought it back to that. Is there anything that, Sangeet, you want to add on it?
So, this is a nice idea, like, you know, all these, like, I’ve been, like, doing machine learning and deep learning since it was cool, you know, like, and I guess, like, there is a field, like, which already exists known as adversarial machine learning, which kind of, like, it injects attack onto your model, like, fake data and all that stuff. What I’m trying to say here is, like, is it possible that we can borrow from the concept which I’ve already existed in the previous years and you use that for AI evaluations and can maybe do like black box red teaming or white box red teaming and how we can so mostly adversarial attacks were used for like vision models and how we can tune that for like textual models like LLMs and all that stuff.
Yeah, I mean one of the things that comes up all the time in our AI red teaming is if you prompt in two languages, so if you do like Spanglish, like Spanish and English, or if you do a mix of different scripts, so languages that are in different scripts, so it’s actually a very common technique in adversarial AI red teaming to use multicultural prompts, but then I think one of the other questions that Taranima brought up earlier is this idea of the prompt response and then like your adjudication of that, whether it’s acceptable or unacceptable, good or bad or like whatever distinction you’re trying to draw telemetry as we all know because we’ve all worked in some kind of software development is not a science, so it’s very hard to determine based on somebody’s IP address or their MAC address, like where their actually physically based, therefore which law or jurisdiction applies to them, what kind of cultural context they may bring.
There’s a lot of things that we have to infer when we’re looking at the prompt responses. And so one of the issues with multicultural AI red teaming, and I think this will come up a lot with our open source software, is exactly what would be like an acceptable response in certain cases. And so that’s one of the many multicultural aspects that we’re excited, honestly, by open sourcing our technology. And we’re hoping that we’re going to get a lot of evaluations in different languages and different cultural contexts so we can start to understand what’s working for different models. How are we on time?
Yeah. Okay. As I was like, you know, we’re talking about safety and multicultural and all that, and then it gets even more complicated with agents. And, you know, you’re not just talking about interpretation, but you’re talking action. And, you know, again, this is one of those places where, in general, general, you can say that if you go back to the idea of software testing, it is a discipline which has been built and refined over the last maybe 50 or even more number of years. But if very crudely I could say evaluations is somewhere around testing and security audits, then we are very, very early. And we are seeing how agents in the last two weeks with a certain bot, how things are going.
So we all have some comments to say about that.
Well, yeah, actually, that was our third topic. So agentic AI and OSS. So Sankit, do you want to?
Yeah, I would like to start this, but I would like to give us like mentioned two small stories which like happened very recently in our open source space. So there’s this OCaml programming language, which is used for like security purposes. Functional programming language. And just like I think like this was towards the end of last year, a person like some it’s a pull. So for the general folks, pull request is basically when you submit a code into the, when you add a feature to an existing code base. So like the person added like 13 ,000 lines of code in just like a single pull request, which is like a very huge thing. And usually like these pull requests are basically get closed if there’s no proper discussion prior to the submitting the pull request.
And this is like just like a buggy code with like so many like patches and all that stuff. It also mentioned like name of some folks who were kind of not related to the project or in any manner. And like this is like, if I remember correctly, it’s like pull request number 14363 in the OCaml code base. And what interesting is to see like the maintainers of the pull request, the maintainers of the project, the language, they interacted like positively with this person. And they’re trying to understand like what’s the reason, why do you want to submit this? Do you understand what this code is? And you are trying to do, and what if the breaking changes happen down the line?
Are you able to, like, come back and fix this? Because this is a very heavy pull. And the person has no idea. He said, like, I was just trying to, like, chat with the chat GPT, and I could generate a long code base, and I just submitted a pull request. Eventually, obviously, the pull request ended up closing, and, yeah, it didn’t go, it didn’t go nowhere. But I think, like, the thing here to mention is, like, it adds a lot of maintenance overhead for these maintainers. These maintainers are overworked all the time. They’re working in research lab, they’re working in organizations, and on their free time, they’re managing projects. And the other story, so this was the person who was using LLMs and trying to add code to the maintain, the code base.
The other example, which is, like, very recently, like, I think, like, only a week ago, I guess folks have heard about this library known as Matplotlib. There’s an agentic AI who would try to, like, do the similar thing. like big change to the code base and when maintainers realize that the person that the GitHub profile which is trying to add the code is not a person it’s a computer they close the pull request stating that we do not have policy for non -human contributions as of now. So what the agentic AI did like it went rogue and wrote a blog post on the internet shaming the maintainers that you are gate keeping the contributors and you should open it all.
Obviously like this stirred a lot of controversy in our ecosystem but we realized that we should chat with this agentic AI and after chatting with them the agentic AI withdraw their first blog post and wrote another blog post apologizing for what they have done earlier. Obviously like this the first blog post was very critical and shamed the contributors and as I said earlier these maintainers are overworked they have like limited time on limited resources and time on their hands. So it kind of adds like you know pressure to like how it kind of kind of raises the question like what does the maintainability look like. like in the age of AI and agentic AI, we should have policies, better policies project -wise and also on the upper level.
Organizations like Numfocus, they are working on implementing these policies over the scientific open source stack. And I think there was this, I heard about GitHub has been considering the AI slot PRs have been increasing over the time. So they are discussing if there’s, whether it makes sense to add like a or something on the PR which says like this PR should be closed because it’s generated by AI. I wonder if my panelists have any thoughts about like what does it look like and…
So
o many, oh my God. Yeah, exactly. Like I guess like, I would like to just narrow down the question like what does it look like and what challenges and opportunities does it have to the AI? And basically how should we like defend… ourselves in these softwares.
Yeah, I mean, having been at GitHub, I was a director there for four years. So much of the incentives of open source software is the credentials in the community that’s built around that. So as a developer who makes a pull request on a known open source project and then has that merged, that is the point of pride. There are badging systems, there are profiles, there are all kinds of things to support developers in their journey. And they’re, again, credentialing along the way. So the idea of generating a bunch of slop code, essentially, and then throwing that into a pull request obviously diminishes the idea. But then, as you’re saying, it makes the already difficult job of maintainers even more impossible because now they have to review such a high volume of code and they’re probably going to revert to some kind of generative AI system to review in that place as well.
So then it also muddles the water of who’s generating what and how you obscure that and what is the provenance behind the code, how do you tag that. I mean, there are just so many issues that go into it. And then once you start… to kind of make those waters murky, like, where do you draw the line? Because even if you had a policy saying, like, this is mostly generated by chat GPT or clot or whatever, you know, that’s up to the person who’s submitting the pull request or the bot submitting the pull request to actually clearly document that.
have not seen any automated pull requests. They’re just not on that radar yet. I would like to mention here, there is this, like, in the month of October, there’s this Hacktoberfest, where you, if you submit, like, I don’t know, five or three pull requests, and it gets merged, you get some sort of goodie or something. And I think for the last couple of years, there’s a lot of contributors, especially students. They have been using the generative code to, you know, push slop into the code bases. And one of the famous examples is Codot. If anyone here is from the gaming industry, they’ve heard about this library. And I think Codot ranks top in the AI slop PRs as of today.
And they were kind of, like, the first set of maintainers who went to the GitHub and, like, please don’t do this. Please do something about this. This is not sustainable. for our project. I actually want to do a quick survey of the audience. How many of you are from industry? Just a quick show of hands. Okay, like maybe 20 % or so. How many are students or just in academia? All right. And non -profits and government? Okay, so you have kind of like an even distribution. That’s very nice to see actually. It affects us all. And from what I’m hearing, I would like to actually sort of introduce a bit of how we could see these things as opportunities.
Because it just shows from the diversity of conversation that is going on here that you could think about a very specific piece of thing and think deeply about it and create a certain idea of how AI systems should perform in that little context. Like, you know, it could be simple as like, you know, in class five mathematics in CBSE in India, this is how the learning outcome is supposed to be and create something that, you know, could test the performance of models and evaluate models. And that could just be a big contribution in itself because it moves the field forward. And there are just all of these different opportunities that are being outlined here from very simplistic things like, you know, outputs of models to the cultural context of things, to the interpretation in multilinguality, to how agentic actions should be understood and evaluated, to red teaming and security.
Like, take your pick and the opportunity to be a contributor to progress of AI and to make it even more useful for all of us is out there. It’s just a very wide open field actually. Yeah.
So Ashwani just mentioned a really interesting point. Like, so So usually like the big open source products, they have like humongous code base. Like you are talking about like code of lines and like thousands and sometimes millions. So what I’ve been seeing like some of the, so what I’ve been seeing like, you know, some of these companies or maybe some of these like startups have been doing like very interesting thing about like mapping the entire architecture of the open source code base. So for a newcomer, it becomes like very daunting like where to start and what type of contribution should I make. But if you have like a clear picture of what does the functions look like, where does the data flows and which classes connect to which, you have like a clear image of the ecosystem of the, sorry, the entire code base of the open source project.
And this is also like very applicable if you’re working in industries like because if you have like a huge software stack and you want someone to onboard, what does the journey look like? Can you use AI and LLMs for like mapping out the entire architecture? And see like where you can, where’s the, what’s the. the best place to start contributing.
So actually, Ashwini, after your survey, I think one thing I also want to say is that since this group is not just software developers, we are saying open source software. I do want to open this to say that everyone, whether you are in the program staff, designing the application, whether you’re considering, right? Everyone has a space in actually the eval’s work. It’s not purely technical, and it shouldn’t be technical, right? We actually find that in use cases where there is a technical team, actually they’re the most cautious in terms of how much they want their services or what the scope of that service is. And we often find that program staff is actually quite ambitious about what the AI application that they’re building should do.
So while Sanket was talking about contributions in terms of start anywhere with software, I would also say this for anyone who’s on the program staff. who’s maybe on the design side, you can start anywhere in terms of the eval stack. And it could be just starting with, this is my list of questions that I want, and this is what my answers for this service should be. Or this is what the ideal should be. So I just want to say this is not just about technical contributions. It’s also about expertise. All of it is. Yeah, I think just agreeing with that last point, I think some of the most interesting conversations I’ve had about human rights, about food security, education, mental health and well -being, have all been in the last couple of years through AI evaluations, which is odd, honestly, to say.
But it’s because we have this generative being or this generative thing essentially giving us an output, and we have to sit there and think about critically what does that mean in any given context. And so that has just resulted in some really, really fascinating discussions around, again, the multicultural aspect, the legality, the cultural context, the geography, all of that. different dimensions of kind of these topic areas. Should we open it up to questions? Yeah, so are there questions in the audience? Yep, want to go?
Thanks to the panel. One of the more technically granular sessions that I’ve had to attend, and I’ve enjoyed it as a former engineer back in the day. Some context, I work on tech and geopolitics. The reason I say that is, given the bigger context of the summit, from long before to even, say, the president of Mozilla saying that open source is the answer to India, you know, really making it big in the AI space, or rather scaling it to where it has the kind of impact that we’re looking to make. Geopolitically, one of the things that strikes me, just from a democratic lens, or a principle -led lens, and I was talking about this to Sanket before the session, could the panel help me understand, and therefore the others, what could be some of the risks that come with the open source approach to scaling up?
versus a open weight, and please check me if my technicalities are off the mark here, or a closed system, for example, right? And whether you highlight a couple of risks or a framework of how to approach risks, like just bad code being added on is one conversation we have heard, right? But are there other loopholes in that process? I’d love to get a perspective on that. Thank you.
I have a lot of thoughts on that with the open weight conversation, but I won’t go into that. One thing I will say is I think open sourcing, like putting evaluations under an open source software license, I think is actually low stakes in the sense that it empowers more people to evaluate the systems that affect their lives. That’s part of our theory of change at Humane Intelligence. So for that, I actually think there’s very minimal downside and a lot of upside. I think one thing that’s going to be quite confusing for a lot of people, though, is the idea of open weight. Open source software. versus open data because when it comes to the actual LLMs, when it comes to the evaluation of the LLMs, the data is obviously a very critical piece.
And obviously just because you open source the software doesn’t mean that the data that’s produced with it is open data. And so that relationship is not one -to -one. So I think there will be a lot of kind of contention between what exactly is open with the software. And that’s something in our research at GitHub that happened quite a lot. Like a lot of organizations that were actually quite sophisticated in the tech didn’t necessarily realize that they could create closed data with open source software or they could use a proprietary software to create open data. Again, I don’t really see a ton of downsides with the AI evaluation. I think one thing that could go wrong is obviously if you take people who are not subject matter experts and then they start to adjudicate things that they…
know nothing about. So if you take somebody who knows nothing about human rights and then they create a policy around whether an output about human rights is good or bad, I would say that’s not a good thing for the world. But that’s probably going to happen regardless. So that’s my lazy answer.
I’d like to just say that in general, the idea of human in the loop has to be done very rigorously when you’re especially thinking about evaluations because you’re more or less putting a stamp of approval on behavior of models in a particular situation, context, safety, whatever. And we are not yet there where things should be automated and certainly caution is better and you would rather index on caution versus speed or volume. If you scale big with open source, you’re saying don’t discount on the human in the loop evaluation aspect. Certainly not right now.
So my question is related to that, right? So it’s broadly around how do you scale red teaming, right? So there’s a lot of, like, human -at -the -loop is great for, like, it’s important for red teaming, but that also means that there are, like, barriers involved in each step, right? Like, you need humans to identify gaps in the system. You need humans to create the prompts that are going, that could be tested, that could test the model. You need humans to, again, evaluate the prompt, the responses, right? Do you have, does the panel have, like, and this is for everybody, does the panel have tips on tools that could perhaps be used to, like, scale different parts of this pipeline so that, because red teaming is also a continuous process, right?
And it’s hard, and as models keep coming out and gaps keep, like, emerging, how do you see, what are ways that you see in which, like, this, these gaps in these, like, parts of the red teaming pipeline could be, like, sped up, perhaps to, like, scale it and evaluate multiple models in different areas, different applications?
One of the things that we’re looking at now is more of ontological -based approaches for, kind of, mapping out the problems based on so what often happens with especially like human in the loop ai red teaming is that you take essentially like a random checklist and just say like these are the prompts and this is what it covers but there’s not really good understanding of the relationship among like what the problem space means so if you’re looking at a human rights instruments for example you could take the different clauses you could take the different people the demographics you could take like the power structures that are inherent in a violent conflict for example put that into an ontology and then basically look at like the proximity of relationships and the strength of relationships and what are like the most egregious cases like what is the thing that’s going to blow up the entire system if like this is the output that comes out so by doing the ontological based approach we’re putting more thought into what the prompt construct should look like and that way when we sit down with ai red teamers we know that the scenarios are actually representative of the problem space and the areas that are most likely to be problematic so i think that’s one way that we’re trying to do it not necessarily for the speed but also for kind of mapping out the methodology and for the replication in the future.
So if somebody were to switch out a model or add a rack system or do anything to modify their system, we can more easily replicate the scenarios and get a temporal aspect as they build something out. But it is true that it does take a lot of time. I’ve seen a lot of examples obviously with synthetic data using LLMs. So you can do seed prompts or you can do narrative creation for your scenarios. But again, unless you have a clear sense of what the problem space is going in, oftentimes it’s just kind of cherry picking at random parts.
Similar in that last year when we were trying to figure out the safety frameworks and whether they do apply for India or not, we were working with this expert group, did focus group discussions, very labor intensive, a lot of thick evidence, ethnographic evidence. And what comes out of those conversations are maybe like themes. So we, for example, understand that there’s a difference in their sex determination. Right. And we understand that acid attacks. a concern. Where you could possibly try automation is in generating then prompts based on those themes, right? One of the challenges when you’re looking at Indian languages is that the current large language models aren’t very great at generating natural like spoken Hindi, spoken Tamil, right?
So even when you have those prompts, we actually found it easier to sometimes just like write it ourselves and like do variations of it ourselves but we did try the automated step which is like if this is the theme, this is like the sort of persona can you generate prompts based on that and that becomes part of like your emails. So I mean I think there is that mix of like automation and human combination that’s possible. It’s still like as the AI, like the LLMs advance the automation will get better but I also think that human sort of instinct like you will need that. I think that step will be needed and also like the way currently to some extent safety is working is that it is a little bit of a whack -a -mole band -aid, right?
So once you discover that there is this risk … that gets sort of patched, right? And then you discover something else, right? So, like, you discover, oh, like, punctuations in Indian languages can actually jailbreak models, right? And once you discover that, you can do all sorts of different combinations of saying, like, let’s try this symbol, let’s try this symbol, and then they’ll fix that issue. Then you discover something else. So, I mean, I don’t think that problem is going to, you know, we’re never going to get, like, a perfectly safe system, but we keep getting, like, you need that human insight to do that first -level testing, understand, oh, this is, like, an un, like, this is a new territory that has not yet been taken care of.
You can use automation, then, to generate more test cases or, like, build your data set.
I was just going to say my other thing, which was she was talking about automation. From someone else I heard, clustering turned out to be a very useful thing for them to find different classifications of behaviors, which was intuitively not obvious when they started off with evaluating models. of outputs and therefore identifying what are the places in which you could concentrate more effort on. And then human in the loop is a very generalized term, but where in the loop? And that would keep changing as we refine things, but I interrupted you.
So in terms of scalability, so first of all, please take this with a pinch of salt because I’m not an expert in this field. I was reading a blog post of Lilian Wang. She is from the OpenAI team and she introduced a concept of like model red teaming, how you use a model to red team a model. And based on, so just like I mentioned earlier, using the reinforcement learning, stochastic learning, how you adjust the model who is red teaming the model you want to correct. Yeah, exactly.
What about like evaluations? Like a lot of people are using judges, LLMs as judges, but do you think that’s a sustainable way of doing it?
Yeah, I think that’s a good question. I think that’s a good way to eliminate the human in the evaluation side. So our take, and we had presented this on the first day, is that you should always do a small, however small, right? It can be a 0 .5%, but always do a spot check with humans as well because ultimately, even when you do LLM as a judge, it struggles with the same language capability barrier that your original model, so that will always happen. And so we think that you should always do a spot check and you will always need a human to do some sample check.
Yeah, just quickly on that. When I was at ML Commons, we did something similar. So we tried to look at, there was research essentially done, like a benchmark of benchmarks. So if you were to use the same LLM that judges the other LLM, then if you have one aspect of bias, then the bias is essentially magnified. So that’s something to keep in mind. If you’re trying to mitigate against bias or hallucinations or whatever the vulnerability is, it will basically be exponentially there if you use the same LLM to judge the LLM.
Hi. Hi. Hi. Thank you guys for the lovely panel. My question was about how governments and kind of standard institutions can think about benchmarking. Specifically, I’d like to know what your thoughts are on standardization, benchmarking, like setting up the right standards for benchmarks, and finally, maintainability, given that the institutions may not have kind of their own in -house experts that stay on for a long time. How do you think about all of these questions, especially in the context of, for example, local language elements that are not really well understood or how we benchmark them?
I have a lot of thoughts on benchmarks. So, having built one, it was not easy. Yeah, one of the things that we think about a lot at Humane is the idea of benchmarking because we get asked so often. Like, again, it’s become the industry darling just because it’s so, I guess, rises to the moment of the hyper -adaptation and hyper -scale that we’re seeing with AI. But one thing that comes up pretty much in every conversation we have with organizations is what exactly are you trying to benchmark? So, we have this case, like, we’re working with an organization, potentially, that works in primary healthcare in Nigeria. what we’re doing in the primary healthcare in Nigeria.
And we’re trying to benchmark And so I asked them, like, are you trying to benchmark for hallucinations in the Yoruba language or bias in the Hausa language? And they didn’t know, literally. They didn’t know. All they knew is that somebody told them to build a benchmark for their AI system, so they should go and do that. So the problem is, like, what happens if you build a benchmark? And, like, if you don’t start with AI red teaming or another evaluation type, you may do a benchmark that looks at, like, hallucinations or, you know, factuality, however you judge that. But then it turns out what is really the problem with your LLM is bias. And so if you have the benchmark that’s measuring the wrong thing, then you built something that is computationally very expensive and takes a lot of time, honestly.
The math is kind of murky with benchmarks, I’ll be honest. And then you’re also not measuring the right thing. So we always recommend to start with red teaming and then identify the problem space. And once you get to that, like, hyper -focused problem space, then you can do a benchmark and say, comparatively speaking, like, this is the model performance against that specific metric. Thank you.
Just to add on that, you know, often bias or like any concern, like the sensitivity and the importance to address it is different in different domains, right? So like bias in the case of, say, a maternal health use case can be very problematic in a context where people are trying to use a bot to understand sex determination. And we’ve seen this in the real world. But say, like, if you are seeing gendered language, it’s always a problem, right? But like the, and if resources are limited, how you prioritize what concern you address depends absolutely on the context or like the specific application. So, yeah, I guess that is to say, like, just make that list.
Like, what are you trying to measure? And I think I heard someone say this, like, what is your headline? So, yeah, what is it that you’re trying to measure? And then. Figure out your, and you can’t measure everything. Like, you know, you can’t measure everything. and then build it around that. And that is the universal thing about benchmarking. It translates very much to anything global or a specific regionally contained language or context.
So just one tiny follow -up. Just in terms of maintainability, which I already asked, maybe Sanket, given that you worked on that, how do you think about maintainability for benchmarks, say, for example, with institution -led government that doesn’t have in -house experts, but would like to, for example, set standards and maintain these benchmarks over time?
Yeah, I don’t think I have bright thoughts on this. Sorry.
I think we have time for one more question, if it’s very quick. Otherwise, we can wrap. Any other final thoughts? No, I mean, I guess… just for everyone, everyone has a role in evaluations. Evals, evals, evals. That’s unfortunately what all of us have.
And you have a role in open source.
Yeah, and of course. Especially with cloud code because now you can make a lot of code cloud. Anyway, thank you all for coming. Appreciate it. Thank you. Thank you. Thank you. Thank you.
Mala highlights that open‑source software broadens participation beyond developers, enabling more people to contribute to AI red‑teaming and evaluation efforts. By creating structured scenarios, open…
EventDeepali Liberhan: Thanks, David. I think Karuna has done such a good job of it, but I’m gonna try and add some additional context. The first thing that I wanna say is just to step back and talk a litt…
Event“Model evaluation and red teamings are essential and we should be doing that.”<a href=”https://dig.watch/event/india-ai-impact-summit-2026/driving-social-good-with-ai_-evaluation-and-open-source-at-sc…
EventCreate protocols for red teaming and adversarial testing at multilateral levels
EventDominique Hazaël Massieux:Just a quick few words about what W3C is and maybe why I’m here. So W3C is a worldwide web consortium and why I’m here. So W3C is one of the leading standard organizations fo…
EventHisham Ibrahim provided specific regional examples, including Saudi Arabia’s IPv6 leadership journey through a 10-year collaboration, the Central Asia Peering Forum as a 4-year initiative, and RIPE NC…
EventAt theInternet Governance Forum 2024 in Riyadh, the sessionDemocratising Access to AI with Open-Source LLMsexplored a transformative vision: a world where open-source large language models (LLMs) demo…
Updates“how can regulatory artifacts like data set cards model cards system cards rigorous evaluations user feedback now be extended to cover multiple languages multiple contexts and multiple cultures”<a hre…
Event-Need for multilingual and multicultural evaluation systems: The discussion emphasized developing benchmarks beyond English-language models, creating evaluation tools that capture societal risks speci…
EventWang outlined Meta’s current practices including publishing model cards, evaluation benchmarks, and performance data for external scrutiny. He described risk mitigation processes involving assessments…
Eventand it was called AI for Global Development, we felt that maybe while agency fund program was working more with the nonprofits who were much ahead in their journey in technology and in using AI, we fe…
EventModerator: Thank you. Thank you for those presentations. They were quite diverse on different topics and I tried to summarize them. And please correct me if I’m missing very important things. We will …
EventWaley Wang:Ladies and gentlemen. Dear friends. Good afternoon. My name is Willy. As a member of CCIT. It’s my honor to discuss the important topic. about Responsible AI with you. I worked in AI resear…
EventWhat are large language models? Large language models (LLMs) are advanced AI systems that can understand and generate various types of content, including human-like text, images, video, and more audio…
Updates“Sanket Verma introduced himself as a NumFOCUS board member and technical‑committee participant.”
The knowledge base states that Sanket Verma serves on the board of directors of NumFOCUS and also serves on the technical committee, confirming his roles [S2].
“NumFOCUS fiscally sponsors core scientific libraries such as NumPy, SciPy, Pandas and Matplotlib.”
S2 explicitly notes that NumFOCUS is a fiscal sponsor for foundational projects used in AI, listing NumPy, SciPy, Pandas and Matplotlib, confirming the claim.
“Open‑source guardrails are especially vital for resource‑constrained regions such as India, where sharing evaluation stacks prevents duplicated effort across organisations.”
The knowledge base highlights that open-source tools are especially necessary for innovation in lower-resource settings, providing broader context for the importance of guardrails in places like India [S31].
“AI red‑teaming is a structured, contextual evaluation method that assembles domain experts to devise adversarial scenarios and probe model weaknesses, rather than relying on generic benchmarks.”
S66 discusses voluntary commitments from AI companies emphasizing that robust red‑teaming is essential for safety and evaluation, adding context to the definition of AI red‑teaming presented in the report.
The panel shows strong consensus that open‑source tools, community involvement, and structured, human‑guided red‑team processes are key to safe, sustainable AI deployment. Benchmarks should be grounded in red‑team findings and tailored to specific contexts, especially for governments with limited expertise. Concerns about AI‑generated contributions and provenance are shared, prompting calls for clear policies.
High consensus across most speakers on the importance of open‑source, community, and human oversight, indicating a unified direction for future AI evaluation practices and policy development.
The panel largely converged on the importance of open‑source, community‑driven AI evaluation and the need for better governance of AI‑generated contributions. Disagreements were limited to the perceived risks of open‑source scaling (audience vs. Mala) and the degree of automation appropriate for red‑teamning (Sanket vs. Tarunima). Most divergences were methodological rather than ideological, focusing on how best to achieve shared goals such as democratizing safety tools, scaling red‑teamning, and managing AI‑generated code.
Low to moderate. The core objectives—enhancing AI safety, fostering open‑source collaboration, and improving evaluation practices—were widely shared. The few points of contention revolve around risk perception and the balance between automation and human oversight, suggesting that while consensus exists on direction, further dialogue is needed to align on implementation strategies.
The discussion was shaped by a handful of pivotal remarks that moved the panel from high‑level optimism about AI to a nuanced examination of concrete risks and practical solutions. Mala’s introduction of AI red‑teaming set the agenda, while Sanket’s anecdote about AI‑generated pull requests exposed an urgent governance gap, prompting a cascade of comments on provenance, credentialing, and policy needs. Cultural framing (additive vs. reductive architecture) and real‑world ethical dilemmas (the HIV‑survivor chatbot) broadened the conversation to include societal context. Proposals for AI‑driven tooling (code‑base mapping) and systematic ontological methods offered constructive pathways forward, and cautions about LLM‑as‑judge kept the dialogue grounded. Collectively, these insights redirected the tone from speculative to action‑oriented, highlighting both the opportunities and the responsibilities that open‑source communities must grapple with in the age of LLMs.
Disclaimer: This is not an official session record. DiploAI generates these resources from audiovisual recordings, and they are presented as-is, including potential errors. Due to logistical challenges, such as discrepancies in audio/video or transcripts, names may be misspelled. We strive for accuracy to the best of our ability.
Related event

