Safe and Responsible AI at Scale Practical Pathways
20 Feb 2026 16:00h - 17:00h
Safe and Responsible AI at Scale Practical Pathways
Session at a glance
Summary
This panel discussion focused on making data “AI-ready” to bridge the gap between valuable information trapped in documents and the potential of artificial intelligence to make it accessible and useful. Shalini Kapoor, the moderator, opened by highlighting how enterprises and organizations possess wealth of information stuck in PDFs and documents that people are reluctant to share with AI due to trust and safety concerns. She emphasized the need to make data interoperable, safe, and trusted so it can be effectively utilized by AI systems.
Rohit Bardawaj from India’s Ministry of Statistics stressed the importance of establishing a uniform definition and framework for AI readiness, noting that many people don’t understand what it takes to make data truly AI-ready. He outlined key requirements including cataloging data in machine-readable formats, providing proper metadata, creating context files, and structuring data with defined dimensions and attributes. Prem Ramaswami from Google’s Data Commons project discussed their open-source approach to creating knowledge graphs from multiple datasets, emphasizing that data should be federated and governed locally rather than centralized. He advocated for using AI as a tool to supplement human intelligence rather than replace it.
Ashish Srivastava brought a practitioner’s perspective, highlighting three critical challenges: data interoperability across fragmented systems, contextualization with domain-specific vocabularies, and data verification rather than relying solely on declared information. The panelists agreed that AI readiness requires combining structured knowledge graphs with large language models, implementing proper governance frameworks, and creating incentive models for data sharing. They concluded that while the technology shows promise, success depends on establishing trust, maintaining data sovereignty, and building collaborative frameworks between institutions and industry to create a sustainable data economy.
Keypoints
Major Discussion Points:
– Data Fragmentation and Silos: The discussion highlighted how valuable information remains trapped in PDFs, documents, and isolated systems across enterprises and government organizations. This creates an “information divide” where entrepreneurs and citizens cannot access relevant data (like government schemes or compliance information) that could benefit them, even when using AI tools.
– AI-Ready Data Framework and Standards: The panelists emphasized the need for a unified framework to define what makes data “AI-ready.” This includes creating machine-readable catalogs, proper metadata, context files, business glossaries, and standardized codes. The discussion stressed moving beyond PDF-based documentation to structured, interoperable formats.
– Trust, Safety, and Data Governance: A central theme was balancing data accessibility with security and trust concerns. The conversation explored federated data models where organizations maintain control over their data while making it AI-accessible, and the importance of verifiable versus declared data for decision-making.
– Practical Implementation and Tools: The panel showcased real-world solutions like Data Commons (open-source platform for statistical data), MCP servers for data interoperability, and the concept of “data boarding passes” for B2B data access. These tools aim to make data accessible without requiring users to leave their existing workflows.
– Business Models and Incentive Structures: The discussion addressed sustainability of data platforms through various funding models, from government-funded public data to commercial licensing. The panelists introduced the “GIVE” model (Guaranteed trust, Incentives, Value, Exchangeability) as a framework for creating viable data economies.
Overall Purpose:
The discussion aimed to address the challenge of making vast amounts of existing data (particularly in government and enterprise settings) accessible and usable for AI applications, while maintaining data sovereignty, trust, and creating sustainable business models for data sharing.
Overall Tone:
The tone was collaborative and solution-oriented, with industry experts and government representatives working together to identify practical approaches. While acknowledging significant challenges (data silos, trust issues, technical complexity), the conversation remained optimistic about the potential for creating AI-ready data infrastructure. The tone was technical but accessible, with speakers using real-world examples to illustrate complex concepts. There was a sense of urgency about the need to start building these systems now, despite imperfections, rather than waiting for perfect solutions.
Speakers
Speakers from the provided list:
– Shalini Kapoor: Panel moderator, works on AI and data initiatives, mentioned working on Amul AI and Bharat Vistar projects, associated with People Plus AI website
– Rohit Bardawaj: From MOSPI (Ministry of Statistics and Programme Implementation), statistician, works on AI readiness of data and has published papers on the topic
– Prem Ramaswami: From Google, works on Data Commons project, previously worked on Google Search, focuses on making public data more accessible through open source platforms
– Ashish Srivastava: Industry practitioner and solution builder with three decades of experience, currently heading A4I lab (AI Innovation for Inclusion Initiative) – a collaboration between Microsoft and IIIT Bangalore, previously headed a Gen AI company, works on AI for social problems
– Audience: Multiple audience members who asked questions during the Q&A session
– Speaker 1: Asked a question about setting up Data Commons instances
Additional speakers:
None identified beyond the provided speakers names list.
Full session report
This panel discussion, moderated by Shalini Kapoor, brought together experts from government, industry, and technology sectors to address the challenge of transforming existing data repositories into “AI-ready” formats. The conversation took place against the backdrop of significant AI developments, including the Prime Minister’s launch of Amul AI that same morning.
Framing the Problem: The Information Divide
Kapoor opened by highlighting a fundamental challenge in today’s data landscape. She illustrated this with a concrete example: an entrepreneur in Nagpur seeking information about biotechnology subsidies available through government schemes. Despite the existence of relevant programmes offering substantial support for women in biotechnology, this information remains buried in government notifications that are neither discoverable through standard search engines nor accessible via current AI systems.
This scenario exemplifies what Kapoor termed the “information divide”—where valuable information exists digitally but remains trapped in organisational silos, locked away in PDFs, documents, and legacy systems across enterprises and government institutions.
Government Perspective: Building Technical Infrastructure
Rohit Bardawaj from MOSPI (Ministry of Statistics and Programme Implementation) challenged the panel’s fundamental assumptions by conducting an audience poll that revealed no uniform understanding of what constitutes “AI-ready” data. He argued that this definitional gap represents the primary barrier to progress, shifting the conversation from technical solutions to foundational framework development.
Bardawaj outlined key requirements for AI-ready data: machine-readable cataloguing systems (preferably JSON or XML rather than PDFs), comprehensive metadata, context files, business glossaries for domain-specific terminology, and structured databases with clearly defined attributes. His ministry has implemented these principles through their Model Context Protocol (MCP) server, which he described as creating a “universal socket” that enables any large language model to access verified government statistical data.
The success of this approach was demonstrated through creative applications, with Bardawaj’s favourite example being the analysis of grain price inflation based on references in Tamil songs. This illustrated both the technical feasibility and unexpected potential of properly structured AI-ready data.
Open Source Approaches to Data Access
Prem Ramaswami from Google’s Data Commons project provided a complementary perspective on making public data accessible through open-source, federated approaches. His work addresses the tension between data accessibility and data sovereignty by enabling organisations to maintain local control while participating in broader interoperability networks.
Ramaswami explained that Data Commons combines structured knowledge graphs with large language models to create AI search engines capable of quickly accessing and analysing multiple datasets simultaneously. He emphasised that this addresses a fundamental limitation of human cognition—our difficulty processing multi-dimensional problems that require computational assistance.
His vision extends to democratising data analysis capabilities, particularly for India’s 74 million micro, small, and medium enterprises, enabling them to access sophisticated analysis without requiring expensive data science teams. The federated model allows organisations to maintain governance over their information while contributing to broader knowledge networks, with successful implementations including work with UN Statistical Department managing data from WHO, ILO, and other agencies.
Industry Implementation Challenges
Ashish Srivastava from IIIT Bangalore brought a practitioner’s perspective, identifying three fundamental problems that AI-ready data must address: interoperability across fragmented systems, contextualisation for domain-specific applications, and verification of data quality.
His work in women and child health illustrated the interoperability challenge, where critical information about child development is split between different government departments—nutrition data managed by Women and Child Development, while birth and immunisation data resides with Health and Family Welfare departments.
On contextualisation, Srivastava noted that while large language models are improving at general tasks, they consistently fail with specialised terminology. His team creates comprehensive glossaries that work alongside LLMs to provide accurate domain-specific translations, requiring significant upfront investment but proving essential for reliable performance.
Srivastava also highlighted the verification problem in public data, where survey-based information often relies on self-declared responses rather than verified facts. He referenced a conversation with an MIT mathematician about LLMs being inherently probabilistic systems, meaning they can never achieve perfect consistency, making external guardrails and human oversight essential.
Economic Models and Sustainability
The discussion revealed important insights about the economic realities of data infrastructure. Bardawaj noted that “open data is not free data,” explaining that MOSPI operates under a tiered model where research use remains free while commercial applications require compensation, reflecting substantial costs in data collection and maintenance.
Kapoor introduced the concept of “data boarding passes” as a standardised approach to B2B data access, providing efficient onboarding processes for organisations to access AI-ready data systems. She also touched on the “give data, give model” concept and the importance of creating proper incentives for data sharing.
Addressing AI Limitations and Governance
A significant portion focused on current AI system limitations. Bardawaj presented evidence that identical prompts applied to the same datasets can produce different results, highlighting reliability concerns. Kapoor mentioned her team’s ongoing benchmarking work to address consistency issues across different AI models.
The panel revealed that making data AI-ready is fundamentally a governance challenge rather than merely technical. The audience poll demonstrated that while technical solutions are important, primary barriers involve coordination, standardisation, and institutional alignment across different organisations.
Practical Applications and Future Directions
The discussion included practical examples and audience questions about business models and coordination challenges, such as road construction projects spanning multiple districts. Speakers repeatedly referenced booth demonstrations where attendees could see these solutions in action.
Srivastava emphasised that AI comprises only 10-15% of effective solutions, with the remaining 85% consisting of supporting infrastructure. This reframes expectations about AI deployment and highlights the importance of comprehensive system design.
Key Takeaways
The panel concluded that creating AI-ready data infrastructure requires hybrid solutions combining government frameworks, open-source approaches, and practical implementation experience. Success depends on institutions’ ability to coordinate effectively, establish shared standards, and maintain long-term commitments to data quality and accessibility.
The emphasis on federated governance models, sustainable economic frameworks, and appropriate safeguards provides a roadmap for developing systems that serve both local needs and broader development objectives. However, as the panellists acknowledged, this represents a long journey requiring sustained effort and collaboration across sectors.
Session transcript
Deep work on working on fragmented data silos. As you all know that AI, it thrives on data. And today, most of the LLMs, what they have done is, they’ve definitely scraped internet and they’re doing really well. But the value of the work or what an answer an LLM would give is present based on what it can fetch from the actual data, which means in enterprises and organizations, there’s a wealth of information. There’s a wealth of information stuck in PDFs, stuck in documents, which people have a fear of not giving it to AI. So there is a fear, there’s lack of trust today, and that data… data, it stays where it is, like digitized. So, for example, there is there could be an entrepreneur, say in Nagpur, wanting to know about the scheme applied for the biotechnology plant that she wants to put up in Nagpur.
Now, if you see the MSME industry has a scheme for her, for women, for biotechnology. And, you know, it’s very good subsidy that that’s available. But where is it stuck? It’s stuck in a government notification which came out, which she’s not aware of. And what she is doing is she’s actually going to LLMS and asking that question and she’s not getting it. She’s also searching it on various places. She doesn’t get it. So that’s the divide, the information divide, which is existing. And the information which is there has which which is there stuck in documents or in even digitized form. That has to be AI ready. so that in a safe, trusted, and these two are very important, safe and trusted manner, the data can be linked, made useful and then made available.
Now, this is a long journey. This is a long journey. It’s not an easy journey because the data journey is about how you clean the data, you make it ready, you link it, you make it relevant, you make it useful and then present it in a manner so that the choice, and you want to have a choice of various elements of, you know, I mean, we live in the age of choice, right? We don’t want to be locked into anything particular. So that’s the data problem that we have in front of us. The opportunity is humongous because there is, I’ll give you an example, 3 ,000, I’m talking to an organization which does 3 ,000.
Thank you. entities and the 3 ,000 entities actually manage 5 million new compliances in a year. They have those kind of queries, 5 million queries on new compliances. Forget existing compliances because there are new compliances which get generated by the government, by various bodies and then they have to search. So the problem is humongous and it can be bridged. It can be bridged but we have to think about how to make data interoperable useful and AI ready. So with that background, I’d like to get into our panel and talk to some of the experts that we have today. My first question is to Rohitji who is from Mosby. So India generates a vast amount of statistical and administrative data.
Mosby actually for all of you, it calculates a GDP for India. they have the source of all the data at village and taluka level so the data is there but as you think about making data AI ready what do you think is the responsibility of institution how and yours is an institution to make the data trusted safe and available to all
thank you Salini ji good morning everyone so trusted safe and ready for everyone AI ready my I like all of you to take a step back on this and just let us understand do you have uniform definition of what is AI readiness at this point in time do we have and I’ll not say that it’s not there in the ecosystem it’s there in the ecosystem that but do we have an agreement about it so there are two issues we need to understand when we talk about AI readiness of data. One is that, so let me just go back to today’s conversation I had with one of my colleagues over WhatsApp group, you know, we all are very active there.
So one of my papers has just been accepted in one of the largest conference and it’s about AI readiness of data and he asked me what’s so great about it. So I asked why, what is not so great about it? So he asked me, I put Bangla into ChatGPT and it completely understands. So what’s new you are doing? So the point I’m trying to make is people are not aware what it takes to make data AI ready. We all understand and then he asked me that no, but it’s not understanding and he talked about some of the dialect of this country and we have a huge number of dialects and Salneji, I asked him and he asked me that how do I train ChatGPT on this dialect?
I said, it’s not my job, it’s Sam Holtzman’s job. So the issue here. is that we don’t know. And that is the biggest responsibility of our institutions like MOSB to make people aware what AI readiness is all about. And then AI readiness means if I start, you know, talking about there should be a context file, there should be semanticity, there should be metadata, but many of us sorry about that, many of us it looks it would not make sense. So the first idea is to create a framework agreed framework, say people not only me, it’s not about my way or highway, me all of us work together create that framework, put it up for people to know.
I would do, the first thing I would do and I plan to do it literally is try to create a slide deck saying what AI can see and what human can see. So my folder, if it has 10 versions of budget 1, 2, 3, 4, 5, 6 and if I ask a question from that folder budget some answer will come from budget one some answer will come from budget two because unlike human where i am focused on this question ai is designed to take scan the entire thing available so it’s a big difference between human and ai i can be focused ai when once i give a thing to ai it will just scan everything it has in its domain so i would say starting point and just you know not taking much of a time uh starting point should be that let us create this framework let us have a shared understanding let us have a core ai readiness part and an aspirational ai readiness part and work on
yeah i think that’s very relevant because you cannot leapfrog into everything you have to be like i mean you can have aspirational but the foundation is very very important and and everybody joining that foundation that that that foundation exercise is is really important um i’ll go to you preb uh and talk about let’s talk about data Data Commons aims to make public data more accessible and usable. You’re from Google, and you have put all this in open source. You’ve been working on US Census data being available. Tell us some more about your experiments and how Data Commons is kind of ready or prepared to work on this challenge.
Thank you for having me here on this panel today. I think one of the areas I’ll start with is the importance of coming to that understanding on AI -ready data, but understanding that the field itself is moving quite quickly at the same time. So whatever agreements we come to today in six months, it feels like we’re dealing with a brand new technological landscape that we’re staring down. What Data Commons tried to do was say that if we can get… If we can get our data in that machine -readable format, which means… structured, which means machine -readable metadata also, and a format where that format specification is not stuck behind a 500 -page PDF, right? Can we make that in a way that the machine can understand it, interpret it, and then use it?
Our theory behind this is that idea of a knowledge graph from that data combined with the large language model gives you a much better chance of success to answer your question. So at Data Commons, what we try to do is we try to bring multiple data sets globally together in a common knowledge graph and then put an AI search engine on top of it so that you can quickly access that data. You can play with this yourself at datacommons .org. But what we did is we open -sourced the entire stack because this idea that that data is centralized with one source is also the dangerous part, and it shouldn’t be, right? The data should be federated.
It should be located at every organization and governed locally by the organizations that are… using it. And so one of the things we’ve done by open sourcing that stack is allowed, for example, the United Nations, the United Nations Statistical Department to use data commons as their back end. And so, you know, UNSDGs, WHO data, ILO data, so on and so forth, is all stored in this common interoperable database now, where instead of a data analyst spending 80 % of their time renaming column headers, they can actually focus on the data analysis so that we can get the impact and the outcomes we want to see. Hope that helped answer the question.
Yes, yes, no, absolutely. I’ll poke you a little bit more to understand on data commons, what’s a vision you have?
So a very simple vision, right, which is make data aware decision making the easy answer to take. Today, right now, the majority of the world is flying blind, whether you’re one of those 74 million MSMEs in India. you can’t afford a bevy of computer scientists and data scientists that you can hire you pay a tax to play with any data if you’re a policymaker thinking about climate change poverty education health these are holistic problems it’s no longer i can go to one ministry pull one spreadsheet and solve poverty i need to endemically understand how does education how does health outcomes how does income and economy how do all of these affect poverty locally right and that’s the problem we have today that the world is a multi -dimensional problem the other problem is our brains are not inherently multi -dimensional our brains are great in three dimensions you add a fourth dimension which is time and we’re okay right like look at climate change you add time and it’s greater than our lifetime we can’t think about it which is why we’re not solving it right but the majority of problems are 50 60 dimensional problems machines are really good at this but by the way.
And humans are good at using tools that are good at doing things we’re not. And this is where we have to approach AI as a tool we can use. Not as the answer, but as a tool we can use to derive the answer, to supplement our brains in the areas we’re
I’ll poke you a little bit more, but later on.
Saniji, I just want to take a second stab on that. And just a quick interjection on that. I’m a statistician. So I’ll be very happy if some of my work can be done by AI, you know, all those lab language models. I just read a paper today in the morning. It’s been written by two undergraduates from a Canadian university. And they said, and they proved it, that if you give same prompt to AI with the same data set, it gives you two types of analysis. So this is something I just wanted to flag. That we should not be really gung -ho about things, which is still untested. But yes, I would be the first to accept adopt an AI and use it for my work, but it needs to be, as you rightly put it, trustworthy.
Yeah, I just comment on this, the stability of an answer, that’s what you’re talking about. We are actually working to create a benchmark onto this because the same thing we are doing, like Amul AI was launched today in the morning, I mean, by the Prime Minister, and the same thing applies to Bharat Vistar, and we are actually working to see that the same question if you ask, multiple times across LLMs, and also to one LLM many times by different farmers, both options, you get different answers. And that, can we make it as a benchmark? That’s what we are working at also because this is a benchmark which is needed really on the ground, right? So that’s a part, so I wanted to comment.
I’ll go to Mr. Ashish. You’re from the industry, and you work with IIIT, . Bangalore. Tell us more about the research in the data area, plus how institutions can help build it all together.
Right. So I think my perspective is more as a practitioner because the last almost three decades I’ve been a solution builder. So I have seen data not from the data side, but from the solution side, trying to exploit it, trying to use it for the solutions. And I’ll come to the institution part of it. But, you know, when I look at the data and the challenges which are there associated with it, then for the last 10, 12 years, I’ve been for AI, for social problems or digital, like women in child health. I worked for almost a decade. Now, one of the problems that I realized is that the world is fast moving where you don’t manage a transaction.
You manage a journey. OK, and that is the agentic AI and all those things that we are talking about. Now, when I was working a few years back on the women and child data. I realize how fragmented it is the two main data sets if you look at a child’s health his anthropometric data, his nutrition data is with women and child development through their Anganwadi program if you look at the birth data, the humanization data and a lot of other data it is with the health and family welfare department and if you have to have an integrated decision making across for that child what needs to be done and then you have to look at both the data but that burden of orchestration comes on to the person who is solution making the data does not by itself flows through the workflow and that is one of my biggest problem that we have to solve that we look at data sets in isolation but we don’t look at how it flows through the process the second thing which I said the contextualization we all have read the book at least some of them that raw data is an oxymoron data always resides in a particular context and with some standardization associated with it so that you can make some sense out of it.
Now with education, when we are working recently, we realized that LLMs are becoming increasingly good, at least with the main languages, not with all the dialects, but in good translation. The moment they hit any domain -specific vocabulary, that’s when they start failing. Even the class 6th physics question, all these frontier models, is not able to properly translate. So we came up with a solution of using a glossary combined with the LLM so that it does a decent job in terms of overall translation. The user is transparent to contextualization. And the third thing which I faced a lot is that when we talk of public data, a lot of it is declared data and not verified.
Not verifiable data. Especially when a lot of planning depends on surveys. and lot of survey data is actually declared data whether you have a hypertension or not yes, no, whether you have this problem yes, no, what is the verification no doctor has actually verified that and you are going to make a decision based on that so in my opinion the AI ready data has to solve these three big problems it has to be interoperable it has to be contextual and it should actually the third problem that I was saying that you know verifiable, it should be verifiable and governable as an extension of that
very relevant I think you have posed the right challenge so Prem I am going to come to you right what is how let’s just pick one of them which is contextualization because I am increasingly seeing that domain information is needed and people are creating these glossaries to add like even in Agri when we had to roll out like we are going to do like we are going to do Mahavista, we actually created glossary of 5000 terms which is it is in Marathi so it has to be in Marathi and those terms being used and I know we did some experiments and we have created a sandbox environment you have done it for India so why don’t you explain that how contextualization and domain can be added to Google Data Commons and how it can be helpful
I think this idea of contextualization and localization is very important at the end of the day these are large language models, language being the key word there they’re not data models and so to what Mr. Bhardwaj said earlier what you want to be able to do is use them to write code to manipulate data because code is language but you don’t necessarily want them to be producing data on their own and one of those problems that you have today is also those large language models are essentially created largely off the web which has its own biases inherent in it, both language and locality -wise. And then on top of that, the example you used on the full folder of all the budgets, right?
The example I like to use for this is actually if you ask a large language model about a celebrity that recently had a breakup, they’ll tell you they’re together because it doesn’t know what just happened over the last month, right? It’s very sad. And so this is where you can use, though, the combination of, you know, you called it a glossary, I always call it a knowledge graph. What is that factual basis of information that I can put together? Now, it’s always going to be a subset of the whole, right? I might be able to cover maybe 0 .1 % of the world’s information with a knowledge graph. But if I can ground it in those facts, can I then utilize the intelligence of the large model to then help me produce some knowledge from those facts or fill in the gaps in those facts?
And so this, I think, is an opportunity that we actually have in the technology to move it forward. This is one of the areas that we’re actively working on as a team. But again, to do that, you first need that glossary of facts, right? This is where having that knowledge graph of statistical data, even if imperfect at this moment, because it is survey collected. It is dependent on the quality of the question asked, the error bar shown, the quality of that metadata, so on and so forth. But it is a starting point from which you can get more information and use that intelligence to potentially even find those outliers or areas that don’t match what you might be hearing on the ground.
So that’s the opportunity I think that we have.
Because I absolutely agree with you, but I will say it in more direct terms. Because sometimes we feel that LLMs or in previous version, the AI models are the solution. They are not the solution. They are only one of the inputs to the solution. And they comprise 10%, 15 % of what you’re trying to do. It is what is the rest of 85%. is doing yes llm will give different answer how are you compensating with guardrails human in the loop risk assessment these are the tools which are available today so i if you have to build because at the end of it it’s a probabilistic model okay come what may and i was talking to a mathematician from mit and he explained why it will never become perfect why it is it is grounded that fact is grounded in mathematics that it is it cannot ever become as perfect that every time consistent that we are wanting it to be ever because then you are taking the main source of its creativity away from it so what you have to focus is outside not inside that that’s all i ever wanted to say
if i agree with you completely and i started by saying it’s a tool right and we use tools to supplant ourselves not to replace ourselves right to supplement our knowledge not to replace our knowledge so i do agree with you it’s a tool but we have to be careful throwing the baby out with the baby and we have to be careful with the baby and we have to be careful with the baby and we have to be careful with the baby and we have to be careful with the bathwater here in the sense that That tool now makes things available to the average person. It upskills the average person in a way that they couldn’t themselves before.
So if we immediately go to put guardrails, prevent access, things like that, we’re preventing a large part of society. And I’ll say as somebody who worked on Google Search for many years, there were many arguments in Google Search that we, for example, shouldn’t put health information on search. Because the average person isn’t smart enough to be able to deduce information about their own health from Google. But the average person can’t afford a doctor also, right? There are endemic problems in society that prevent you from doing that. So does the answer to that question suffer, or does the answer to that question do less harm and give people a pathway that they can learn from? And so that’s an important question to ask ourselves here as we think about AI, which is, yes, it is imperfect at this moment.
Can we understand? Can we educate? Can we work inside the system that exists? we can’t ignore it either. We can’t say it made one mistake, therefore I will not use it. And I will also call out the imperfection of us as humans is also very much there, right? So there are many times we look at these systems and, you know, we look at, you know, a way more autonomous vehicle and we said, look, it had six accidents last year. The 30 ,000 deaths from car accidents in the U.S. a year, right? And so statistically speaking, this is still much safer, right? And so these are the sorts of examples that we have to look at, understand where to apply it, how to apply it, and what the overall societal good is from using it.
Yeah. No, thanks. I think a very relevant discussion that we are having, and there’s always a fight between should we have RAG architecture or should we just, you know, teach, give all to LLM to do it because it has more capacity and more, you know, GPU. But either or is not possible. There’s like, it’s like so much about the world. It’s like, you know, it’s like, you know, it’s like, you know, it’s like, you don’t want to give you maybe want to keep the data and the sovereignty comes in. it a lot. And this has been a discussion in the last two days. Most of the panels that I have been that you want to keep your data.
Countries want to keep the data with themselves and they actually don’t want to train because choice of LLMs is like you want a lot of choice and you want to use here, there, everywhere. So I’ll come back to you Rohitji and see we talked about administrative data and you talked about a framework. So my question is that how do you think alternate data, secondary data beyond administrative data, how can that be also brought in and your framework which you talked about that there should be a foundational framework if that framework is adopted by industry. One, is it possible? And two, what kind of data economy it can start?
So this is early morning. Let me take an audience poll on it. How many of you think that what Salini asked is a governance issue? Or is it a, I mean, just raise your hand if you feel it’s a governance issue. Anyone who feels it’s a governance issue? How many of you feel it’s a technological issue? What she asked. How to make alternative data ready for AI. That’s what the question was. So how many of you feel it’s a technology? There’s no prizes for it. There’s no punishment for it. So feel free to raise your hand the way you think. It’s a technology. So, okay. So I am with that gentleman. I feel it’s a governance issue.
And I’ll also work on it. So what are we talking about? We are talking about data generated from different sources, be it alternative data sources, be it like administrative data sources. The panelist with my co -panelist just talked about getting data from different sources not aligned to each other. So it’s a governance issue which we need to understand first. We need to create. And, of course, I completely agree with Salini when she said that we need a federated model. Perhaps Prem said that. We need a federated model. There cannot be one whole sole owner for a data. of this country or for that matter for any country what as as somebody needs to play the role of data steward somebody needs to orchestrate this data ecosystem and that perhaps being from nso i have my own biases i’ll say nso can do it but of course that’s something for the people to decide now let’s understand this what do we need when we need ai ready data we need first a cataloging of it i’m just going to take one minute on it cataloging of it you should have everything catalog any industry any government organization this is my data set these are the indicators these are the definitions and so on and so forth i’m not getting that deep into it you need a catalog of your data and if that’s not there second thing is that catalog should not be pdf that catalog should be as she was saying machine readable json file probably you need a catalog of your data and if that’s not there you need a catalog of your data and many other ways are there but let’s talk about you JSON file.
Second point, you should have metadata for it. If you don’t have metadata for it, I mean other day I was with another panel with Prem, I said the thing which irritates me the most is lack of metadata. I don’t know. I’ve been driving in blind. I don’t know what the word frequency means. It may mean hundreds of things. So you should have metadata and again not in PDF. So when I’m, whatever I’m talking about is, I’m not, I mean JSON or XML, there are so many ways, but machine readable. Let’s put it that way. Third is, you should have a context file. So now machine has read it. Now but it wants to know that where do I find the meaning of frequency?
So machine should have a context file where the source is written. You go there and see. You will find the meaning of frequency. So metadata will not have frequency, meaning of frequency. It will only write frequency means quarterly. So machine now needs to understand what does that frequency means. So that’s what she was talking about and Tim again was talking about. We need to have a, that makes us, bring us to the, we need to have a business glossary. We need to have a business glossary. He also talked about a knowledge graph. I mean, just a sophisticated version of business glossary. That we need to have. So once we have sorted this out, we need to work, what type of codes are we working?
So the gentleman just beside me just talked about two data sources using different codes for different thing. I mean, same thing. So then we have to standardize that codes. And then lastly, we have to structure our data. Data needs to go in a structured database. It should be defined and that’s not new I’m talking about. It should be defined by dimensions. It should be defined by attributes. It should be defined by its role. So time means temporal. You can’t write time and expect LLM to understand what does time mean. You have to say time means temporal. And once you have these ready, these available in a, so there are two use cases. And just last, last quick of the One is that am I using it for my own use case?
Am I training my own model for it? Then I can put all these in one file and feed it to my model. But if I’m expected to create a MCP for my database, then I have to create separate files, put it up in a URI or URL where any model can go, the connector can direct it to that model, that place, that resource place, and then the things happen. And this is all I’m talking from my personal experience when we, and Salneji knows about it, when we developed our own MCP server
loving it the amount of reach out which has happened to use the data data sets for you know you can actually find out you can ask a question of what’s been the price of how has the price of moong dal been in the last whole year or whole quarters or month wise so that that capability is there now and it has happened because it was always there they do the calculation of the wholesale price index the commodity price index so that in from the data was there it’s just that now it is ai ready for people to consume take and and then ask and it is connected to claude and chat jpt ashish i’ll go to i’ll go to you uh for because building on what uh roheji stopped at which is the use cases and you come from the solution part of it uh how do you visualize and imagine solutions and use cases and how do you visualize and imagine solutions and use cases you combining say administrative data and alternate data I’m not going into personal data because there’s a lot of consent there but at least a lot of secondary sources of data which is available and how do we combine and make it more powerful
I think as you rightly pointed out I come from the solution perspective and a solution now with agentic AI coming in we look at every solution in form of a journey. We are going past the mechanism of point solution that you ask it reverts back to the answer and now the use case has to decide at which part of the journey what data is that you need and that will dictate whether it is additional data sets which are outside or it is a public data set it will be due. The only challenge which I see here is the who is accountable for that data Thank you. Thank you. in the solution at the API level, at the policy engine level, which are actually going along with the solution, and it should happen, it should be enforceable automatically.
If you are thinking that a human being will actually enforce that policy, it will break. It will break in no time. So that is what we are trying to do, is to create those reusable artifacts as DPIs or DPGs, it will fall into one of those categories. But where it allows those policies to be set for a data set in an easy reusable way so that everybody doesn’t have to recreate from scratch those kind of policies, and then that’s the way to move forward.
You mentioned your lab. I’m sorry, I just spoke you into that. Tell us more about your lab. What more work they are doing?
So that’s my current job. Previously I was heading a Gen AI company, by the way, and I will talk separately later on PDF challenge, which we thought we had solved it. We didn’t fully, but we were on the way. But the current lab, which is very exciting, it’s a collaboration between Microsoft and IIIT Bangalore. A4I stands for AI Innovation for Inclusion Initiative. That means we create large scale. The idea is not here to run pilots that we do this small thing here, we diagnose, not that. It should be population scale and we want to launch it as a DPG so that it can be largely. So we are working on education, school education area. We are working with teachers in terms of making their life easy.
We are working in terms of accessibility. How blind children can actually be taught STEMs so that they can actually become a mathematician. They can hope to become a physicist, mathematician. Today it’s very difficult. How to even read a book? And the third one we are doing is working with the last mile health workers. Our current solution is a rack based AI combination, but we are looking at exactly that problem that you mentioned that either it is this or that. I think there are plenty of answers which are in between. The. That was what we are exploring.
Thank you. Thank you so much. Prem, I’ll again build on the concept that we were discussing on the use cases, which can be. I mean, I just want you to paint a picture of if you have data in knowledge graphs, like what you mentioned, if the data is there and data commons is present. I just want you to visualize that what more use cases can be possible with secondary data. How can India benefit and not just India, Global South benefit from this? And please feel free to paint the use cases which you have built in the sandbox environment that you have. You can just take those examples.
Yeah, I’ll give two very. These might not be exactly where the sandbox is today, but where it could go tomorrow. Right. And so I’ll give two very different examples here. One is. At the end of the day, the Ministry of Statistics does a lovely job collecting as much information as they can. The whole ministry does. The government does. it’s a top -down data collection.
I’m sorry, I’ll just interrupt you. I think Rohitji will say it’s not top -down. It’s actually at the field level, it’s bottom -up.
That’s fair, that’s fair.
He will say that, it’s bottom -up.
That’s fair, that’s fair. You’re correct, it’s bottom -up. That said, we have alternate data sources also that are there. Sometimes they supplement and they further show, yes, the data collected is correct. At times they disagree. And those disagreements are also interesting to understand to the point of where is the survey question flawed or where is the civil society seeing something or has visibility into something that we don’t have access to. And so the more of these data sets that come together, these points of friction, again, this is where the human intelligence comes in. Show me the points of friction. I have a haystack full of needles. Which needles do I pay attention to? Right? So this is one example if I’m at the government or the statistics, you know, ministry of statistics level.
Now let’s go to the completely opposite end. I’m a small business owner. I’m setting up a physical shop. Where should I set it up? Right? Where I set it up depends on mobility traffic, depends on the demographics and affordability in that space, depends on all types of things. Right? It’s a large data question. But that MSME owner is often ill -equipped to answer any of those questions, is often taking a shot in the dark. And that shot in the dark is a costly shot in the dark if they’re wrong. Right? Because they are taking the full risk of that decision. Now with the data commons that we’re building, the question becomes can we reduce that risk for that individual?
Can we help them model, understand, de -risk the decision they’re making? And that’s what we’re doing. And that’s what we’re doing. based on the audience they want, based on the footfalls they want, based on the location that they’re choosing. That’s a very specific example now. But these are two very opposite examples of how bringing all of this data together, which we often think about as more aligned towards, you know, the international organizations or the government minister, but is actually usable on the ground by an individual too.
Tell us a bit more about, like if suppose someone wants to put up a Data Commons instance, how can they get started?
It’s actually quite simple. It’s easy enough that I can do it myself, which means you can. But it’s datacommons .org is an open source platform. We have a 20 -minute guide to get started. You can set the whole thing up on your computer, have your CSV data set, bring it in. And the thing is, once you bring one data set in, it overlays with all the data sets already in Data Commons. This creates sort of a network effect between the two. To the data, right? So once I bring in, you know, if I am a chain store in India trying to figure out that next store location, if I bring in all my per store sales revenue data once, then suddenly I can compare that to the 50 ,000 data sets and overlay them that are already in data comments.
Before, if I wanted to do this as a chain store in India, I would normally have my people come up with maybe 10, 12 different hypotheses. Because then I have to get those 10, 12 different data sets and I have to form 13 different data transforms, right? So they’re all in the same format. That prevents us from being able to have that level of creativity we want where we can look across the entire landscape of the problem set. And so this is sort of one of the things.
Right answer. And it was a matter of trust for NSO also. That, you know, people are getting different answers for the data which is created by NSO. That makes sense. It makes us look toward MCP server. A, it is open. so it makes our data interoperable for all the almost all the AI system. I am not saying all the AI system. Otherwise what would happen? Be aware that every LLMs have their own standards of API. So you create those APIs first and then you somehow manage the LLM to approach that API. With this connector, it’s like C socket for the phone charger, if I may use the parallel, where you can just plug in any C socket you can use for anything.
That’s what the MCP is. So data comes and the LLM comes and plugs into MCP. And it allows any LLM to come. But what you have to do now, that you have to connect that small tool with that LLM. So that’s a one minute job and it’s available on our website. You go there, www .mosfet .gov .in and the offering section, everything is available. You can do it in one minute. One minute, maybe two minutes at the most. Anyone. But still there is one challenge, which I must tell you, is that somehow need to ensure that this becomes a default tool. The user does not have to add it. Somebody says somebody forgets it. Then the same situation starts happening again.
So right now people have to add it to their tool. But the biggest advantage I see is that people don’t have to come out of their workflow. So if I have taken a very costly pro cloud, then I don’t have to come out of it. Go to my portal to get the data analysis. I can keep using the intelligence of cloud or chat GPT. I don’t have a preference for that. With the verified data, as he talked about, verified data of Mospy. And the use cases are innumerous on the web now. I mean people have just lapped it up. My favorite is that there is a Tamil song which talks about a lot of grains.
So one of the messages I got, and I’ll share the link also. It’s on Twitter also. I mean X now. That somebody created a CPI for all the grains which was talked about in that CPI is consumer price index which basically talks about inflation uh which talks about all the grains and they just took the grain out of the uh song that you know now wheat and and created CPI index for it and they have named it like p index or something which is like songs name so I’m not very conversant pardon me for that in Tamil but I’ll share that link so that’s my favorite use case so people so what I mean to say that people can use the data the way they like it that’s the that’s the bottom line and that’s that’s what the NSO’s idea
is most interesting use case I would have I would have seen and I really want to see uh it and and say yeah yeah I’ll have a look at it so one more thing which I want to tell uh the audiences that uh the work uh see several like the use case uh Rohitji mentioned about that someone can just pick the data uh so we have created a data and we have created a data and we have created a data and we have created a data and we have created a data and we have created a data and we have created a data and we have created a data and we have created a data and we have created a concept called as data boarding pass concept called as data boarding pass concept called as data boarding pass so this is a data boarding pass so this is a data boarding pass This is like for AI ready India.
This is a physical copy, but actually the concept is that once your data is ready and it is it has a set of checklists. Which it passes, then as a B2B player, you could be a policymaker, you could be a researcher, you could be a market player wanting to build on top of it. You can take this, you know, this concept of data boarding pass and get onboarded onto the date or for the data usage so that you can pick the data and then start using it in your applications. So data boarding pass is, say, at a district level, you have and I’m just painting a scenario. You have a data commons where graph knowledge graph and data have been all combined together, created all together, right context and everything.
And some organization. Now wants to know, say, the automobile. MSME manufacturer wants to access it and give information to dealers as to where scooters are being sold, where motorcycles are being sold and what’s been the income of of that region over a period of time that that can be possible now. Right. So the data boarding pass enables it, makes it possible. And if you want to physically see it, how this exactly works, visit our booth at a step foundation on Hall three in on the first floor. Do visit that. And my team would be there to show you the actual generation of the data I think we have given a lot of things. I want to just, you know, we have less time, but I want to take a couple of questions from the audience.
So feel free to ask. We have four minutes so we can have like two, three questions from the audience. audience. I saw that first, sorry, and then I saw you. So next to you. Yeah, please go ahead. Can someone give him a mic, please? Otherwise, I’ll hand mine.
Thank you very much. I wanted to ask you about the business models of these platforms because it is obviously extremely important to have high -quality data, but high -quality data is also expensive to collect, to maintain in the time. So did you work, besides, on how you can maintain these kind of platforms during the time? Does it have to be, I don’t know, publicly paid or whatever models you may have? And it’s also for everybody, I think.
Go ahead, then I’ll also add. Then I’ll also add.
So, Jasmo, I just have a quick clarification on that. And National Statistics Office India is fully funded by the Government of India. It’s a… I mean, as we all know, National Statistics Office India is fully funded by the Government statistics office over all over are like public funded through public money. So it’s our job to create data and make it available to the public. At the same time, just one quick disclaimer on that, that open data is not free data. So somebody has paid for it. So when depending on the use, we provide the data. So if the use is research and things like those, I’m not getting into details of it, then it’s free.
But if the resource, you know, the use is commercial, then, of course, there is a system. There is a policy for it. And people have to pay accordingly.
Yeah. So I’ll also answer it because we have done a good amount of work. I would encourage you to see a paper that I’ve put up on our People Plus AI website, which talks about the give data, give model for data. G is guaranteed trust. And we talked about it. I is incentive. Incentive. Why should I bring the data? What will I get it get from it? The V is the value. If the data has no value, nobody is interested. And E is exchangeability. right which is can i share the data so i’ll focus on the i the incentive there has to be an incentive for someone to bring the data and there has to be an incentive for someone to use the data and that value will be monetized that is the data economy if you ask me this data economy is actually running without a formal mechanism there’s good amount of money people in selling data buying data lead generation i mean there’s huge amount of things which are happening this formalizes that so they will be but what will be the price that the economy the data economy i mean that has to stabilize that has to happen at region level with private sectors so we have been working in that direction so that the incentive model is clear but the actual price is is a discovery mechanism
and it’s very uh very interesting to hear all this that’s amazing one of the very key scenario that we see every day and we get little bit trouble is we see a road making getting made and stuck after few days I mean yeah it might not feel good but that’s how it is because it somewhere it feels like a disconnection in the data or somewhere decision in the policy making so do we have some way to kind of get this kind of pieces applied in those like know whatever the tender ecosystem or whatever that like you know you have a road made and then a duck for a pipeline after a very short window
yeah maybe I’ll answer it see if you see India has put the whole digital public infrastructure in place these are the DPI thinking whether UPI Aadhaar DigiLocker DigiYatra they were about digital rails which were put together this data infrastructure that we talked about today is going to be that rails is it going to be dug up maybe maybe maybe no problem Promises, right? Is it going to be dug up? Are there going to be holes in that? Maybe. But I think it’s a journey that if we don’t do it and don’t start it now, it’s going to hit us later on. So no promises, but yes. Rohit, do you have to add anything on that?
I just wanted to add that we need to keep working on these data sharing platforms and all the philosophies we just talked about, like accessibility, sharing, analysis, use of AI, and things will improve slowly but steadily, I’m very sure about it.
Time is up, and the next session is going to start. So thank you so much for listening in to the AI -ready data, and please visit the booth to see it actually in action. Thank you. Bye. Bye. Bye. Bye. Thank you. Thank you. you you you you Thank you. Thank you. Thank you.
Shalini Kapoor
Speech speed
128 words per minute
Speech length
2572 words
Speech time
1200 seconds
Data Fragmentation and the Need for AI‑Ready Information
Explanation
Shalini highlights that data is trapped in fragmented silos and often remains only digitised, creating a lack of trust. She stresses the need to make data interoperable, safe and trusted so it can be linked and used for AI applications.
Evidence
“Deep work on working on fragmented data silos.” [5]. “It can be bridged but we have to think about how to make data interoperable useful and AI ready.” [1]. “so that in a safe, trusted, and these two are very important, safe and trusted manner, the data can be linked, made useful and then made available.” [3]. “So there is a fear, there’s lack of trust today, and that data… data, it stays where it is, like digitized.” [7].
Major discussion point
Data Fragmentation and the Need for AI‑Ready Information
Topics
Data governance | Artificial intelligence
Benchmarking AI Outputs for Trust and Stability
Explanation
She points out that the same question asked to different LLMs or multiple times to the same model yields divergent answers, underscoring the need for a benchmark to ensure consistent and trustworthy AI responses.
Evidence
“That’s what we are working at also because this is a benchmark which is needed really on the ground, right?” [79]. “we are actually working to see that the same question if you ask, multiple times across LLMs, and also to one LLM many times by different farmers, both options, you get different answers.” [84]. “people are getting different answers for the data which is created by NSO.” [83].
Major discussion point
Trust, Stability, and Benchmarking of AI Outputs
Topics
Artificial intelligence | Monitoring and measurement
G.I.V.E. Framework for Data Economy Incentives
Explanation
Shalini introduces the G.I.V.E. (Guarantee, Incentive, Value, Exchangeability) model, arguing that both data providers and users need incentives and a clear monetisation path for a sustainable data economy.
Evidence
“right which is can i share the data so i’ll focus on the i the incentive there has to be an incentive for someone to bring the data and there has to be an incentive for someone to use the data and that value will be monetized that is the data economy…” [118]. “G is guaranteed trust.” [90]. “I is incentive.” [127]. “The V is the value.” [129]. “E is exchangeability.” [125].
Major discussion point
Business Models, Incentives, and the Data Economy
Topics
The digital economy | Financial mechanisms
Data Boarding Pass Concept for Rapid On‑boarding
Explanation
She proposes a “data boarding pass” that lets B2B actors quickly select and onboard AI‑ready datasets, enabling faster deployment of applications across sectors.
Evidence
“You can take this, you know, this concept of data boarding pass and get onboarded onto the date or for the data usage so that you can pick the data and then start using it in your applications.” [141]. “So the data boarding pass enables it, makes it possible.” [142]. “this is a data boarding pass so this is a data boarding pass This is like for AI ready India.” [143].
Major discussion point
Practical Use Cases and Impact Scenarios
Topics
Data governance | Artificial intelligence
Rohit Bardawaj
Speech speed
185 words per minute
Speech length
2308 words
Speech time
746 seconds
Lack of Uniform Definition for AI‑Readiness
Explanation
Rohit questions whether the ecosystem has agreed on a common definition of AI‑ready data, emphasizing that without a shared definition the effort remains fragmented.
Evidence
“do you have uniform definition of what is AI readiness at this point in time do we have and I’ll not say that it’s not there in the ecosystem it’s there in the ecosystem that but do we have an agreement about it” [16]. “It should be defined and that’s not new I’m talking about.” [24].
Major discussion point
Data Fragmentation and the Need for AI‑Ready Information
Topics
Artificial intelligence | Data governance
Proposed Shared Framework and Standards for AI‑Ready Data
Explanation
He advocates for a shared framework with core and aspirational layers, coupled with cataloguing, machine‑readable JSON metadata, and a business glossary to standardise data for AI consumption.
Evidence
“let us create this framework let us have a core ai readiness part and an aspirational ai readiness part and work on” [20]. “first idea is to create a framework agreed framework, say people not only me, it’s not about my way or highway, me all of us work together create that framework” [39]. “we need a cataloging of it … you should have a catalog … not pdf … machine readable json file” [11]. “We need a business glossary.” [56]. “So we need to standardize that codes.” [33].
Major discussion point
Frameworks and Standards for AI‑Ready Data
Topics
Data governance | Artificial intelligence
Governance Challenge in Integrating Alternative Data
Explanation
Rohit stresses that bringing alternative or secondary data into AI pipelines is primarily a governance issue, requiring a federated model, a data steward, and policy enforcement at the API level.
Evidence
“We need a federated model.” [41]. “It’s a governance issue which we need to understand first.” [92]. “someone needs to play the role of data steward” [11]. “How to make alternative data ready for AI.” [8].
Major discussion point
Governance vs. Technology for Integrating Alternative Data
Topics
Data governance | The enabling environment for digital development
Open Data Is Not Free – Commercial Use Requires Policies
Explanation
He notes that while public data is funded by the state, commercial usage must be governed by policies and possibly fees, countering the notion that open data is automatically free.
Evidence
“At the same time, just one quick disclaimer on that, that open data is not free data.” [116]. “if the resource, you know, the use is commercial, then, of course, there is a system.” [119].
Major discussion point
Business Models, Incentives, and the Data Economy
Topics
The digital economy | Financial mechanisms
Prem Ramaswami
Speech speed
188 words per minute
Speech length
2119 words
Speech time
672 seconds
AI‑Ready Data Must Be Machine‑Readable with Metadata and Context
Explanation
Prem describes the technical requirements for AI‑ready data: structured, machine‑readable formats, accompanying metadata, and a context file that identifies the source.
Evidence
“If we can get our data in that machine‑readable format, which means… structured, which means machine‑readable metadata also, and a format where that format specification is not stuck behind a 500‑page PDF” [25]. “Second point, you should have metadata for it.” [26]. “So machine should have a context file where the source is written.” [27]. “Third is, you should have a context file.” [28].
Major discussion point
AI‑Ready Data Must Include Machine‑Readable Metadata, Context Files, and Structured Formats
Topics
Data governance | Artificial intelligence
Data Commons as an Open‑Source Federated Knowledge‑Graph Stack
Explanation
He explains that Data Commons aggregates global datasets into a common knowledge graph, offering an open‑source stack that can be deployed by organisations such as the UN for standardized data ingestion.
Evidence
“we try to bring multiple data sets globally together in a common knowledge graph and then put an AI search engine on top of it” [45]. “once you bring one data set in, it overlays with all the data sets already in Data Commons.” [48]. “open sourcing that stack is allowed, for example, the United Nations Statistical Department to use data commons as their back end.” [50].
Major discussion point
Data Commons Provides an Open‑Source, Federated Knowledge‑Graph Stack to Standardise Data Ingestion
Topics
Data governance | Artificial intelligence | Information and communication technologies for development
Practical Use Case: De‑risking MSME Location Decisions
Explanation
Prem illustrates how a retailer can upload its sales data to Data Commons, overlay it with 50,000 public datasets, and derive insights for store placement, demonstrating tangible impact for MSMEs.
Evidence
“if I bring in all my per store sales revenue data once, then suddenly I can compare that to the 50,000 data sets and overlay them that are already in data comments.” [131]. “MSME manufacturer wants to access it and give information to dealers as to where scooters are being sold…” [132].
Major discussion point
Data Commons Can Help MSMEs De‑risk Location Decisions
Topics
Social and economic development | Artificial intelligence
Grounding LLM Output in a Factual Knowledge Graph
Explanation
He proposes using the knowledge graph as a factual backbone so that large language models can fill gaps while staying anchored to verified data.
Evidence
“if I can ground it in those facts, can I then utilize the intelligence of the large model to then help me produce some knowledge from those facts or fill in the gaps in those facts?” [73]. “They supplement and they further show, yes, the data collected is correct.” [74].
Major discussion point
Ground LLM Output in a Factual Knowledge Graph to Supplement Missing or Outdated Facts
Topics
Artificial intelligence | Data governance
Ashish Srivastava
Speech speed
151 words per minute
Speech length
1240 words
Speech time
491 seconds
Three Core Requirements: Interoperability, Contextualisation, Verifiability
Explanation
Ashish argues that AI‑ready data must be interoperable, provide contextual meaning, and be verifiable to ensure reliable decision‑making.
Evidence
“it has to be interoperable it has to be contextual and it should be verifiable” [10]. “Not verifiable data.” [15].
Major discussion point
Interoperability, Contextualisation, and Verifiability Are Three Core Requirements
Topics
Data governance | Artificial intelligence
Domain‑Specific Vocabulary Failures and Glossary Solution
Explanation
He notes that LLMs stumble on domain‑specific terms and suggests combining a glossary (or knowledge graph) with the model to improve translation and understanding.
Evidence
“The moment they hit any domain‑specific vocabulary, that’s when they start failing.” [64]. “came up with a solution of using a glossary combined with the LLM so that it does a decent job in terms of overall translation.” [66].
Major discussion point
LLMs Falter on Domain‑Specific Vocabularies; Glossaries/Knowledge Graphs Can Bridge the Gap
Topics
Artificial intelligence | Capacity development
Guardrails, Human‑in‑the‑Loop, and Risk‑Assessment for Reliable AI
Explanation
Ashish stresses that because AI models are probabilistic, deploying guardrails, human oversight, and risk‑assessment mechanisms is essential for trustworthy outcomes.
Evidence
“guardrails human in the loop risk assessment these are the tools which are available today …” [95]. “If we immediately go to put guardrails, prevent access, things like that, we’re preventing a large part of society.” [96].
Major discussion point
Guardrails, Human‑in‑the‑Loop, and Risk‑Assessment Mechanisms Are Essential for Reliable Deployment
Topics
Artificial intelligence | Building confidence and security in the use of ICTs
Agentic AI Solutions Should Be Journey‑Oriented
Explanation
He frames AI solutions as journeys, where each stage (data preparation, model use, outcome) must be managed, and accountability for data quality rests with the solution layer.
Evidence
“I come from the solution perspective and a solution now with agentic AI coming in we look at every solution in form of a journey.” [134]. “You manage a journey.” [98].
Major discussion point
Agentic AI Solutions Should Be Journey‑Oriented; Accountability for Data Quality Resides with the Solution Layer
Topics
Artificial intelligence | Capacity development
Speaker 1
Speech speed
136 words per minute
Speech length
23 words
Speech time
10 seconds
Request for Simple Step‑by‑Step Guide to Set Up Data Commons
Explanation
The participant asks for a low‑barrier, practical guide that would enable organisations to quickly deploy a Data Commons instance for downstream AI applications.
Evidence
“Tell us a bit more about, like if suppose someone wants to put up a Data Commons instance, how can they get started?” [49].
Major discussion point
Accessibility and Adoption of Data Commons
Topics
Capacity development | Information and communication technologies for development
Audience
Speech speed
172 words per minute
Speech length
200 words
Speech time
69 seconds
Sustainability and Data Gaps – Real‑World Project Delays
Explanation
The audience raises concerns about the need for a sustainable data‑economy model and cites fragmented data causing delays in infrastructure projects such as road‑construction pipelines.
Evidence
“Does it have to be, I don’t know, publicly paid or whatever models you may have?” [117]. “we see a road making getting made and stuck after few days … disconnection in the data …” [151].
Major discussion point
Audience Concerns on Sustainability and Data Gaps
Topics
Social and economic development | Data governance
Agreements
Agreement points
AI should be viewed as a tool to supplement human capabilities rather than replace them
Speakers
– Prem Ramaswami
– Ashish Srivastava
Arguments
Large language models are tools to supplement human knowledge, not replace it, and should upskill average users
LLMs comprise only 10-15% of a solution, with the remaining 85% being guardrails, human-in-loop, and risk assessment
Summary
Both speakers emphasize that AI systems are tools that enhance human capabilities rather than complete solutions, requiring significant human oversight and complementary systems
Topics
Artificial intelligence | Building confidence and security in the use of ICTs
Data must be machine-readable and properly structured with metadata for AI readiness
Speakers
– Rohit Bardawaj
– Prem Ramaswami
Arguments
AI-ready data requires cataloging, machine-readable metadata, context files, business glossaries, standardized codes, and structured databases
Open-source approach with federated data governance allows local control while enabling interoperability
Summary
Both speakers agree that AI-ready data requires proper structuring, machine-readable formats, and comprehensive metadata, though they approach implementation differently
Topics
Data governance | Artificial intelligence | Information and communication technologies for development
Domain-specific context and glossaries are essential for AI systems to work effectively
Speakers
– Ashish Srivastava
– Prem Ramaswami
– Shalini Kapoor
Arguments
Domain-specific vocabularies require glossaries combined with LLMs for proper translation and context
Knowledge graphs provide factual basis to ground LLMs and fill information gaps
Data boarding pass concept enables B2B players to access AI-ready data with proper onboarding
Summary
All three speakers recognize that AI systems need domain-specific knowledge and contextual information to function properly, whether through glossaries, knowledge graphs, or structured onboarding processes
Topics
Artificial intelligence | Data governance | Information and communication technologies for development
Federated data governance model is preferable to centralized control
Speakers
– Rohit Bardawaj
– Prem Ramaswami
Arguments
Need for data stewardship and orchestration of data ecosystems at national level
Open-source approach with federated data governance allows local control while enabling interoperability
Summary
Both speakers advocate for federated models where data remains with local organizations while enabling interoperability, rather than centralized data control
Topics
Data governance | The enabling environment for digital development
AI consistency and reliability issues require careful attention and solutions
Speakers
– Rohit Bardawaj
– Shalini Kapoor
Arguments
Same prompts to AI with same datasets can produce different analyses, requiring trustworthy approaches
Stability and consistency of AI answers requires benchmarking across different LLMs and usage scenarios
Summary
Both speakers acknowledge the problem of AI inconsistency and are working on solutions to measure and improve reliability of AI systems
Topics
Artificial intelligence | Building confidence and security in the use of ICTs | Monitoring and measurement
Similar viewpoints
Both speakers view AI as inherently imperfect systems that require human oversight and should be treated as tools rather than complete solutions
Speakers
– Ashish Srivastava
– Prem Ramaswami
Arguments
AI models are probabilistic and will never be perfectly consistent, requiring external guardrails and human oversight
Large language models are tools to supplement human knowledge, not replace it, and should upskill average users
Topics
Artificial intelligence | Building confidence and security in the use of ICTs
Both speakers identify data fragmentation and silos as major barriers preventing effective access to information and services
Speakers
– Shalini Kapoor
– Ashish Srivastava
Arguments
Fragmented data across different departments creates orchestration burden for solution builders
Information divide prevents entrepreneurs from accessing relevant government schemes and subsidies
Topics
Data governance | Social and economic development | Closing all digital divides
Both speakers emphasize that data challenges are primarily governance issues requiring systematic policy and organizational solutions rather than just technical fixes
Speakers
– Rohit Bardawaj
– Ashish Srivastava
Arguments
Making data AI-ready is fundamentally a governance issue rather than just a technological challenge
Policies must be enforceable automatically at API and policy engine levels, not dependent on human enforcement
Topics
Data governance | The enabling environment for digital development
Unexpected consensus
AI limitations and the need for human oversight
Speakers
– Prem Ramaswami
– Ashish Srivastava
– Rohit Bardawaj
Arguments
Large language models are tools to supplement human knowledge, not replace it, and should upskill average users
LLMs comprise only 10-15% of a solution, with the remaining 85% being guardrails, human-in-loop, and risk assessment
Same prompts to AI with same datasets can produce different analyses, requiring trustworthy approaches
Explanation
Despite being from different sectors (Google, industry, government), all speakers showed remarkable consensus on AI’s limitations and the critical need for human oversight, which is unexpected given the current AI hype cycle
Topics
Artificial intelligence | Building confidence and security in the use of ICTs
Open data requires sustainable business models and isn’t truly free
Speakers
– Rohit Bardawaj
– Shalini Kapoor
– Audience
Arguments
Open data is not free data – someone pays for collection and maintenance, with different pricing for research vs commercial use
Give data model requires guaranteed trust, incentives, value, and exchangeability for sustainable data sharing
Business models for data platforms need to address the high costs of collecting and maintaining quality data
Explanation
There was unexpected consensus across government, industry, and audience that ‘free’ data isn’t actually free and requires sustainable business models, challenging common assumptions about open data
Topics
Financial mechanisms | Data governance | The enabling environment for digital development
Overall assessment
Summary
The speakers showed strong consensus on key technical and governance aspects of AI-ready data, including the need for proper structuring, metadata, domain context, federated governance, and treating AI as a tool requiring human oversight. There was also agreement on addressing data silos and the need for sustainable business models.
Consensus level
High level of consensus across different sectors (government, industry, academia) on fundamental principles, which suggests these are well-established best practices. The implications are positive for developing coherent policies and standards for AI-ready data infrastructure, as stakeholders are aligned on core requirements and challenges.
Differences
Different viewpoints
Whether making data AI-ready is primarily a governance or technology issue
Speakers
– Rohit Bardawaj
– Audience
Arguments
Making data AI-ready is fundamentally a governance issue rather than just a technological challenge
Practical guidance needed for organizations wanting to implement Data Commons instances
Summary
Bardawaj conducted an audience poll and argued that preparing data for AI is fundamentally about governance structures and coordination rather than purely technical solutions, while audience members seemed more focused on technical implementation aspects
Topics
Data governance | The enabling environment for digital development
Extent of caution needed when implementing AI solutions
Speakers
– Rohit Bardawaj
– Prem Ramaswami
Arguments
Same prompts to AI with same datasets can produce different analyses, requiring trustworthy approaches
Large language models are tools to supplement human knowledge, not replace it, and should upskill average users
Summary
Bardawaj emphasized the need for caution due to AI inconsistency and untested capabilities, while Ramaswami advocated for embracing AI as a tool that democratizes access, warning against being overly restrictive with guardrails
Topics
Artificial intelligence | Building confidence and security in the use of ICTs
Role and proportion of AI in overall solutions
Speakers
– Ashish Srivastava
– Prem Ramaswami
Arguments
LLMs comprise only 10-15% of a solution, with the remaining 85% being guardrails, human-in-loop, and risk assessment
Large language models are tools to supplement human knowledge, not replace it, and should upskill average users
Summary
Srivastava argued for a very limited role of AI (10-15%) with heavy emphasis on external controls, while Ramaswami promoted AI as an empowering tool that should be more accessible to average users
Topics
Artificial intelligence | Building confidence and security in the use of ICTs
Unexpected differences
Data collection methodology characterization
Speakers
– Shalini Kapoor
– Prem Ramaswami
– Rohit Bardawaj
Arguments
Information divide prevents entrepreneurs from accessing relevant government schemes and subsidies
Data Commons can help small business owners make location decisions by reducing risk through data modeling
Need for data stewardship and orchestration of data ecosystems at national level
Explanation
An unexpected disagreement emerged when Ramaswami characterized government data collection as ‘top-down’ and Kapoor immediately corrected him, with Bardawaj’s implicit support, that it’s actually ‘bottom-up’ from field level. This revealed different perspectives on how national statistical systems operate
Topics
Data governance | Monitoring and measurement
Verification vs declared data reliability
Speakers
– Ashish Srivastava
– Rohit Bardawaj
Arguments
Data must be interoperable, contextual, and verifiable/governable to solve key problems
MCP server enables interoperability across AI systems without users leaving their workflow
Explanation
Srivastava raised concerns about survey data being ‘declared’ rather than ‘verified’ (like self-reported health conditions), questioning the reliability of much public data used for planning. This created tension with Bardawaj’s promotion of NSO data accessibility, as it implicitly questioned the quality of official statistical data
Topics
Data governance | Monitoring and measurement | Building confidence and security in the use of ICTs
Overall assessment
Summary
The discussion revealed moderate disagreements primarily around the balance between AI adoption and caution, with speakers agreeing on goals but differing on implementation approaches. Key tensions emerged between promoting AI accessibility versus ensuring reliability and trust.
Disagreement level
Moderate disagreement level that reflects healthy debate about implementation strategies rather than fundamental opposition to AI-ready data initiatives. The disagreements suggest need for balanced approaches that combine innovation with appropriate safeguards, and highlight the complexity of creating sustainable data governance frameworks that serve multiple stakeholders.
Partial agreements
Partial agreements
All speakers agreed on the need for standardized frameworks and interoperability for AI-ready data, but disagreed on implementation approaches – Bardawaj focused on institutional frameworks and governance, Ramaswami emphasized open-source federated models, and Srivastava stressed technical requirements for verification and context
Speakers
– Rohit Bardawaj
– Prem Ramaswami
– Ashish Srivastava
Arguments
Need for uniform definition and agreed framework for AI readiness with core and aspirational components
Open-source approach with federated data governance allows local control while enabling interoperability
Data must be interoperable, contextual, and verifiable/governable to solve key problems
Topics
Data governance | Artificial intelligence | The enabling environment for digital development
Both agreed that combining structured knowledge with LLMs is superior to using LLMs alone, but Ramaswami focused on knowledge graphs for factual grounding while Srivastava emphasized domain-specific glossaries for contextual accuracy
Speakers
– Prem Ramaswami
– Ashish Srivastava
Arguments
Combination of knowledge graphs with large language models provides better success for data access
Domain-specific vocabularies require glossaries combined with LLMs for proper translation and context
Topics
Artificial intelligence | Knowledge Graphs and Contextualization
Both agreed that sustainable data sharing requires proper economic models and incentive structures, but disagreed on mechanisms – Kapoor proposed market-driven price discovery while Bardawaj advocated for policy-based pricing differentiation between research and commercial use
Speakers
– Shalini Kapoor
– Rohit Bardawaj
Arguments
Give data model requires guaranteed trust, incentives, value, and exchangeability for sustainable data sharing
Open data is not free data – someone pays for collection and maintenance, with different pricing for research vs commercial use
Topics
Data governance | Financial mechanisms | The digital economy
Similar viewpoints
Both speakers view AI as inherently imperfect systems that require human oversight and should be treated as tools rather than complete solutions
Speakers
– Ashish Srivastava
– Prem Ramaswami
Arguments
AI models are probabilistic and will never be perfectly consistent, requiring external guardrails and human oversight
Large language models are tools to supplement human knowledge, not replace it, and should upskill average users
Topics
Artificial intelligence | Building confidence and security in the use of ICTs
Both speakers identify data fragmentation and silos as major barriers preventing effective access to information and services
Speakers
– Shalini Kapoor
– Ashish Srivastava
Arguments
Fragmented data across different departments creates orchestration burden for solution builders
Information divide prevents entrepreneurs from accessing relevant government schemes and subsidies
Topics
Data governance | Social and economic development | Closing all digital divides
Both speakers emphasize that data challenges are primarily governance issues requiring systematic policy and organizational solutions rather than just technical fixes
Speakers
– Rohit Bardawaj
– Ashish Srivastava
Arguments
Making data AI-ready is fundamentally a governance issue rather than just a technological challenge
Policies must be enforceable automatically at API and policy engine levels, not dependent on human enforcement
Topics
Data governance | The enabling environment for digital development
Takeaways
Key takeaways
AI-ready data requires a comprehensive framework including cataloging, machine-readable metadata, context files, business glossaries, standardized codes, and structured databases
Data silos and fragmentation across organizations create significant barriers to AI implementation, requiring governance solutions rather than just technological fixes
AI should be viewed as a tool to supplement human intelligence (10-15% of solutions) rather than a complete solution, requiring guardrails and human oversight
Knowledge graphs combined with large language models provide more reliable and contextual data access than LLMs alone
Open-source, federated data governance models enable local control while maintaining interoperability across systems
A sustainable data economy requires clear incentive models with guaranteed trust, value creation, and exchangeability mechanisms
Practical implementations like MCP servers and Data Commons demonstrate viable pathways for making data AI-ready and accessible
Resolutions and action items
Create a shared framework for AI readiness with core and aspirational components that can be adopted by industry
Develop standardized cataloging systems with machine-readable metadata and context files
Implement the ‘data boarding pass’ concept for B2B onboarding to AI-ready data systems
Build benchmarking systems to measure consistency and stability of AI responses across different LLMs
Establish automatic policy enforcement at API and policy engine levels rather than relying on human enforcement
Visit demonstration booths to see practical implementations of AI-ready data systems in action
Unresolved issues
How to achieve widespread adoption of the proposed AI readiness framework across different organizations and sectors
Determining optimal pricing mechanisms for the data economy while balancing accessibility and sustainability
Resolving the technical challenge of ensuring consistent AI responses across different models and usage scenarios
Addressing the fundamental tension between data sovereignty/local control and the need for data sharing and interoperability
Managing the verification and quality control of declared vs. verified data in public datasets
Scaling solutions from pilot projects to population-scale implementations
Handling domain-specific vocabularies and dialects that current LLMs struggle with
Suggested compromises
Hybrid RAG (Retrieval-Augmented Generation) architecture that combines knowledge graphs with LLMs rather than choosing either approach exclusively
Federated data governance model that allows organizations to maintain local control while enabling interoperability
Tiered pricing model for data access – free for research use, paid for commercial applications
Phased implementation approach with foundational framework first, then aspirational features
Balance between preventing AI access through excessive guardrails while ensuring safety and accuracy
Combination of automated policy enforcement with human-in-the-loop oversight for critical decisions
Thought provoking comments
Do you have uniform definition of what is AI readiness at this point in time? People are not aware what it takes to make data AI ready… So the first idea is to create a framework agreed framework, say people not only me, it’s not about my way or highway, me all of us work together create that framework.
Speaker
Rohit Bardawaj
Reason
This comment was foundational because it challenged the entire premise of the discussion by questioning whether participants even had a shared understanding of ‘AI readiness.’ It highlighted a critical gap – that before solving technical problems, there needs to be conceptual alignment.
Impact
This shifted the discussion from technical solutions to fundamental definitions and frameworks. It established the need for collaborative standard-setting and influenced subsequent speakers to ground their contributions in clearer definitions and shared understanding.
If you give same prompt to AI with the same data set, it gives you two types of analysis… we should not be really gung-ho about things, which is still untested.
Speaker
Rohit Bardawaj
Reason
This comment introduced a sobering reality check about AI reliability and consistency – a critical issue for data-driven decision making. It challenged the optimistic tone about AI capabilities with concrete evidence of instability.
Impact
This comment prompted Shalini to reveal they were actively working on benchmarking this exact problem, leading to a more nuanced discussion about AI limitations and the need for stability measures. It grounded the conversation in practical challenges rather than theoretical possibilities.
LLMs or AI models are not the solution. They are only one of the inputs to the solution. And they comprise 10%, 15% of what you’re trying to do. It is what is the rest of 85%… it’s a probabilistic model… it cannot ever become as perfect that every time consistent.
Speaker
Ashish Srivastava
Reason
This was a paradigm-shifting comment that reframed AI from being the centerpiece to being just one component in a larger solution architecture. The mathematical grounding about probabilistic models provided scientific backing to the limitations discussion.
Impact
This fundamentally changed how the panel discussed AI implementation, moving from ‘how to make AI work’ to ‘how to build systems where AI is one reliable component.’ It led to deeper discussions about guardrails, human-in-the-loop systems, and risk assessment frameworks.
The world is a multi-dimensional problem… our brains are not inherently multi-dimensional… machines are really good at this but… we have to approach AI as a tool we can use. Not as the answer, but as a tool we can use to derive the answer.
Speaker
Prem Ramaswami
Reason
This comment provided a philosophical framework for understanding the human-AI relationship by clearly articulating the cognitive limitations of humans versus the computational strengths of machines, while maintaining human agency in the process.
Impact
This elevated the discussion from technical implementation to strategic thinking about human-AI collaboration. It influenced how other panelists framed their subsequent comments about AI applications and helped establish a more balanced perspective on AI capabilities versus human judgment.
When we talk of public data, a lot of it is declared data and not verified… especially when a lot of planning depends on surveys… what is the verification no doctor has actually verified that and you are going to make a decision based on that.
Speaker
Ashish Srivastava
Reason
This comment exposed a fundamental flaw in data quality that undermines the entire AI-ready data premise – that much of the data being prepared for AI consumption is inherently unreliable at its source.
Impact
This introduced a new dimension to the data readiness discussion – not just technical formatting but data integrity and verification. It led to discussions about the need for verifiable and governable data systems, adding complexity to the technical solutions being proposed.
Open data is not free data. So somebody has paid for it… if the use is research… then it’s free. But if the resource… the use is commercial, then… people have to pay accordingly.
Speaker
Rohit Bardawaj
Reason
This comment introduced the economic reality of data infrastructure, challenging assumptions about ‘free’ public data and highlighting the sustainability challenges of data commons initiatives.
Impact
This prompted a broader discussion about business models and incentive structures for data sharing, leading Shalini to elaborate on the ‘GIVE’ model and data economy concepts. It shifted the conversation from purely technical to economic and policy considerations.
Overall assessment
These key comments fundamentally shaped the discussion by introducing multiple layers of complexity that moved the conversation beyond technical implementation to address foundational issues. Rohit’s opening challenge about definitions set a tone of critical examination rather than assumption-based discussion. The reliability and limitation comments by both Rohit and Ashish created a more realistic framework for AI implementation, while Prem’s philosophical framing provided a balanced human-AI collaboration model. Ashish’s data verification concerns added a quality dimension that hadn’t been adequately addressed, and the economic reality check introduced sustainability considerations. Together, these comments transformed what could have been a purely technical discussion into a comprehensive examination of the social, economic, philosophical, and practical challenges of creating AI-ready data infrastructure. The discussion evolved from ‘how to make data AI-ready’ to ‘what does AI-ready really mean, what are its limitations, and how do we build sustainable, reliable systems around it.’
Follow-up questions
How do we create a uniform definition and agreed framework for AI readiness of data?
Speaker
Rohit Bardawaj
Explanation
There is currently no uniform definition of what constitutes AI-ready data, and establishing this foundation is crucial before any meaningful progress can be made in making data AI-ready across organizations and institutions.
How can we ensure the same question asked multiple times to LLMs produces consistent answers?
Speaker
Shalini Kapoor and Rohit Bardawaj
Explanation
Both speakers noted that asking the same question to an LLM multiple times or across different LLMs produces different answers, which is a critical reliability issue that needs to be addressed through benchmarking and standardization.
How can we make data interoperable across different government departments and agencies?
Speaker
Ashish Srivastava
Explanation
The fragmentation of data across different departments (like women and child development vs. health and family welfare) creates barriers to integrated decision-making and comprehensive solutions.
How can we verify declared data versus actual verified data in surveys and public datasets?
Speaker
Ashish Srivastava
Explanation
Much of the public data used for planning is based on self-declared survey responses rather than verified information, which can lead to inaccurate decision-making and policy formulation.
How can we create automatic policy enforcement at the API level for data governance?
Speaker
Ashish Srivastava
Explanation
Manual enforcement of data policies is prone to failure, so there’s a need to develop automated systems that can enforce data governance policies at the technical level.
How can we make MCP (Model Context Protocol) servers a default tool rather than requiring manual addition by users?
Speaker
Rohit Bardawaj
Explanation
Currently users must manually add MCP tools to their workflow, which creates friction and potential for users to forget, defeating the purpose of seamless data access.
What sustainable business models can support high-quality data platforms over time?
Speaker
Audience member
Explanation
High-quality data is expensive to collect and maintain, so understanding viable business models for sustaining these platforms is crucial for long-term success.
How can data infrastructure prevent coordination failures in public works projects?
Speaker
Audience member
Explanation
The example of roads being dug up shortly after construction due to lack of coordination suggests a need for better data sharing and planning systems in government infrastructure projects.
How can we develop better translation capabilities for domain-specific vocabulary in regional languages?
Speaker
Ashish Srivastava
Explanation
While LLMs are improving at general translation, they still fail with domain-specific terms, which is critical for making AI accessible in local contexts and specialized fields.
How can we create effective guardrails and risk assessment mechanisms for AI systems while maintaining accessibility?
Speaker
Ashish Srivastava and Prem Ramaswami
Explanation
There’s a tension between making AI accessible to average users and implementing necessary safety measures, requiring research into balanced approaches.
Disclaimer: This is not an official session record. DiploAI generates these resources from audiovisual recordings, and they are presented as-is, including potential errors. Due to logistical challenges, such as discrepancies in audio/video or transcripts, names may be misspelled. We strive for accuracy to the best of our ability.
Related event

