Safe and Responsible AI at Scale Practical Pathways
20 Feb 2026 16:00h - 17:00h
Safe and Responsible AI at Scale Practical Pathways
Summary
The panel opened by Shalini Kapoor highlighted that enterprises and governments hold vast amounts of information in fragmented PDFs and digitised documents, creating an “information divide” that limits AI’s ability to provide accurate answers [1-7][8-15][16-18]. She illustrated this with the example of a Nagpur entrepreneur unable to locate a biotechnology subsidy because the relevant government notification remained hidden in a siloed document, which LLMs could not retrieve [7-13][14-15].
Rohit Bardawaj argued that before data can be considered AI-ready, the ecosystem must agree on a clear definition and a shared framework that includes cataloguing, machine-readable metadata, context files and business glossaries [33-46][160-205]. He emphasized that such a framework should be open and federated, avoiding a single data owner and ensuring a data steward orchestrates the ecosystem [181-184][185-194].
Prem Ramaswami described Google’s Data Commons as an open-source platform that transforms diverse datasets into a machine-readable knowledge graph, enabling an AI search layer that can combine global statistics with local queries [55-63][64-66][69-71]. He noted that the system is designed to be bottom-up, allowing users to overlay their own CSV data onto existing public datasets, thereby reducing risk for small businesses making location decisions [277-283][298-302].
Ashish Srivastava added that real-world solutions suffer from data fragmentation, requiring interoperability, contextualisation through glossaries, and verification of declared data to be truly AI-ready [92-102][103-108][124-130]. He advocated for reusable policy artifacts (DPIs/DPGs) that can be automatically enforced at the API level, preventing reliance on manual human enforcement [228-236].
The participants agreed that LLM outputs can be unstable, prompting the need for benchmarks that test consistency across models and repeated queries [80-84][85-88]. A brief debate emerged over whether making alternative data AI-ready is primarily a governance issue or a technical one, with Rohit ultimately framing it as a governance challenge that requires standards and stewardship [162-170][176-180].
Shalini introduced the “data boarding pass” concept, a checklist-based certification that would allow organisations to certify data as AI-ready and facilitate secure, on-demand access [353-360][361-363]. She also referenced a “give-data-give-model” framework that ties incentives, value and exchangeability together to sustain a formal data economy [390-398][399-401].
The panel concluded that while building AI-ready data infrastructures is a long-term journey, collaborative standards, open tools and incentive mechanisms are essential to unlock the massive potential of data for both India and the Global South [408-410].
Keypoints
Major discussion points
– The fundamental problem of fragmented, non-AI-ready data - Enterprises and governments hold massive information in PDFs, legacy systems, and siloed databases that lack trust, safety, and interoperability, preventing LLMs from delivering accurate answers. Examples include an entrepreneur in Nagpur unable to find a biotechnology subsidy because the notification is stuck in a document [7-15] and the massive compliance-query load of 3,000 entities handling 5 million new queries per year [23-27].
– Need for a shared, institutional framework to make data “AI-ready” - Panelists stress that institutions (e.g., MOSB/NSO) must define standards, create federated governance, and provide catalogues, metadata, context files, and business glossaries so data can be safely reused. Rohit proposes a consensus framework and a “core + aspirational” AI-readiness model [33-46]; later he outlines concrete steps: machine-readable JSON catalogues, metadata, context files, and knowledge-graph glossaries [160-224].
– Practical use-cases that illustrate the value of AI-ready data - When data is structured and linked, it can power diverse applications: government-level statistical analysis, MSME location-risk modelling, agricultural decision support, education-domain translation via glossaries, and health-worker tools. Prem describes how Data Commons can de-risk a shop-owner’s location choice by overlaying private sales data with 50 k public datasets [298-302]; Ashish highlights a journey-centric solution that enforces data policies automatically [228-235].
– Trust, consistency, and benchmarking challenges - LLMs can return different answers for the same query, raising concerns about reliability. Rohit cites a study where identical prompts produced divergent analyses [75-82]; Shalini notes ongoing work on a benchmark to measure answer stability across LLMs and users [84-88]; Ashish stresses the need for guardrails, human-in-the-loop risk assessment, and verification of public data [125-130][126-130].
– Building a sustainable data economy with incentives - The panel proposes mechanisms such as a “data boarding pass” checklist, the G-I-V-E model (Guarantee, Incentive, Value, Exchangeability), and differentiated licensing (free for research, paid for commercial use) to motivate data contribution and ensure long-term funding. Shalini outlines the boarding-pass concept and incentive framework [353-361][391-399]; Rohit clarifies public funding and commercial licensing for NSO data [380-388].
Overall purpose / goal
The discussion aimed to diagnose why large-scale data in India remains “AI-unready,” to propose institutional and technical standards that make data safe, trusted, and interoperable, and to illustrate how such standards can unlock high-impact applications for government, MSMEs, and the broader public sector while laying the groundwork for a formal data economy.
Tone of the discussion
– The conversation opens with a concerned, problem-identifying tone, highlighting data silos and trust gaps.
– It shifts to a collaborative, solution-focused tone as participants outline frameworks, open-source tools, and federated governance.
– Mid-session the tone becomes cautiously critical, emphasizing inconsistencies in LLM outputs and the need for benchmarks and guardrails.
– Toward the end it turns optimistic and promotional, showcasing concrete use-cases, the “data boarding pass,” and calls to action for audience engagement.
Overall, the tone evolves from problem-statement to constructive planning, tempered by realism about technical limits, and concludes with an encouraging call for adoption and partnership.
Speakers
– Ashish Srivastava
– Area of Expertise: AI, data interoperability, contextualization, verification, agentic AI, education and health solutions.
– Role/Title: Practitioner; leads the AI Innovation for Inclusion Initiative (A4I) Lab – a collaboration between Microsoft and IIIT Bangalore; former head of a Gen AI company. [S1]
– Prem Ramaswami
– Area of Expertise: Data Commons, knowledge graphs, AI-ready data, open-source data platforms, AI-driven search.
– Role/Title: Google – Lead for the Data Commons project (open-source stack, knowledge-graph integration). [S2]
– Shalini Kapoor
– Area of Expertise: AI-ready data governance, data economy, policy, trusted and safe AI deployment.
– Role/Title: Chief Strategist, XSTEP Foundation. [S4]
– Rohit Bardawaj
– Area of Expertise: AI-readiness frameworks, data standards, governance, metadata and cataloguing.
– Role/Title: Representative of Mosby (statistical agency that calculates GDP at village/taluka level). [transcript]
– Speaker 1
– Area of Expertise: (not specified)
– Role/Title: Moderator/host (unspecified). [S7]
– Audience
– Area of Expertise: Varied (participants asking questions on data platforms, business models, etc.).
– Role/Title: Audience members / questioners. [S10][S11][S12]
Additional speakers:
– (None identified beyond the list above)
The panel opened with Shalini Kapoor (Shalini) describing a fundamental bottleneck: enterprises and governments hold vast quantities of information in fragmented PDFs, legacy systems and isolated silos. Because artificial intelligence-especially large-language models (LLMs)-“thrives on data” [2] and much of this data is “digitised but stays where it is” [6], AI cannot retrieve the answers users need. She illustrated the problem with a concrete case-an entrepreneur in Nagpur looking for a biotechnology-plant subsidy is unable to locate the relevant government notification because it is hidden in a siloed document, and her queries to LLMs and conventional search tools return nothing [7-13][14-15]. The “information divide” is compounded by a lack of trust in sharing data with AI systems [5-6].
The scale of the challenge was underscored by an example of an organisation that serves 3 000 entities and must handle five million new compliance queries each year [23-27]. Such a volume of “new compliances” generated by multiple government bodies creates a massive problem that can only be bridged if the data is made interoperable, useful and AI-ready [28-29].
Rohit Bardawaj (Rohit) then shifted the discussion to the need for a shared institutional definition of AI-readiness. He asked whether the ecosystem already has a “uniform definition” and argued that a consensus framework-comprising a “core + aspirational” model-is essential [33-46]. According to Rohit, AI-ready data must be accompanied by a machine-readable catalogue, rich metadata, a context file and a business glossary; without these artefacts the data cannot be safely and reliably consumed by AI [160-205][184-205]. He further stressed that any framework should be open, federated and avoid a single data owner, with a designated data steward orchestrating the ecosystem [181-184][177-184]. Rohit also described the MCP server-a lightweight connector that lets any LLM plug into a catalogued dataset via a standard URI, analogous to a USB-C socket, enabling seamless integration without leaving the user’s workflow [221-240].
Prem Ramaswami (Prem) presented Google’s Data Commons as a concrete, open-source realisation of that vision. Data Commons ingests diverse public datasets, converts them into a structured, machine-readable knowledge graph, and layers an AI search engine on top, thereby improving the chance that an LLM can answer a query correctly [55-58][59-60]. The platform is deliberately “federated”-each organisation retains local governance of its data while still contributing to a common graph [61-64]. Prem highlighted a bottom-up use case: a small retailer can upload its own CSV of store-level sales, which then automatically overlays with roughly 50 000 public datasets already in Data Commons, allowing the retailer to model location risk and de-risk decisions that would otherwise be “a costly shot in the dark” [277-283][298-302]. He also noted that AI can be statistically safer than human-only decisions, citing road-traffic-death statistics [144-147].
Ashish Srivastava (Ashish) added a practitioner’s perspective on why data must be more than just structured. He described how fragmented health and education datasets impede integrated decision-making, and argued that AI-ready data must be interoperable, contextualised (through domain-specific glossaries), and verifiable because many public surveys are merely “declared data” without independent validation [92-102][103-108][124-130]. In his own work, Ashish combines glossaries with LLMs to improve translation of domain-specific terminology [102-106][112-118].
The panel then turned to the reliability of AI outputs. Rohit recounted a recent paper by two undergraduates that showed identical prompts fed to the same LLM on the same dataset produced two different analyses, underscoring the need for benchmarks [75-82]. Shalini confirmed that her team is developing a benchmark to test answer stability across multiple LLMs and repeated queries, noting that “the same question … asked multiple times … can give different answers” [84-88]. Ashish reinforced this concern, stating that LLMs should be treated as a small component (10-15 % of a solution) and that robust guardrails, human-in-the-loop risk assessment and verification are essential to maintain trust [125-130].
When asked whether making alternative, secondary data AI-ready is a technical or governance problem, Rohit conducted an audience poll and concluded that it is primarily a governance issue that requires standards, a federated stewardship model and clear policy before any technical solution can succeed [162-170][176-180]. He reiterated the need for a data steward-potentially the National Statistics Office (NSO)-to catalogue datasets in machine-readable JSON, attach metadata and context files, and standardise codes and dimensions [181-221][184-221].
Shalini highlighted the tension between Retrieval-Augmented Generation (RAG) architectures and pure LLM approaches, emphasizing that data sovereignty and the need to keep sensitive data under local control prevent a single-model solution [310-322].
She introduced the “data boarding pass”, a check-list-based certification that signals a dataset has met AI-readiness criteria (catalogue, metadata, context, glossary). Once certified, the dataset can be instantly onboarded by B2B users, policymakers or researchers [353-363]. Shalini also presented the GIVE framework (Guarantee, Incentive, Value, Exchangeability) as a model for a sustainable data economy, arguing that incentives are needed for data owners to contribute and that value can be monetised while ensuring exchangeability [380-389][391-399]. Rohit clarified that the NSO is publicly funded, so research-use data is free, but commercial use is subject to a policy-driven pricing structure [380-388].
An audience member raised concerns about the business model for maintaining high-quality data platforms. The response highlighted that public funding covers research use, while commercial licences generate revenue, and that the GIVE model provides a “formalised” mechanism for pricing and incentives [372-376][380-388][391-399]. Shalini further noted that without a clear incentive structure, “the data economy is actually running without a formal mechanism” [390-399].
The discussion also revealed points of agreement and divergence. Both Shalini and Rohit agreed that statistical data collection is bottom-up, i.e., gathered at the field level rather than imposed centrally [269-276]. Prem argued that, despite imperfections, AI can be statistically safer than human-only decisions [144-147]; Ashish warned that AI should remain a minor (10-15 %) component of any solution, requiring extensive human oversight [125-130]. Technically, Rohit’s checklist-centric approach (catalogues, metadata, context files) differed from Prem’s emphasis on a knowledge-graph-centric, federated stack [184-221][55-64].
In conclusion, the panel converged on five pillars for AI-ready data: (1) a common, detailed definition that includes cleaning, linking, safety, trust, machine-readable catalogues, metadata and context files; (2) a governance-first, federated stewardship model to avoid single-point ownership; (3) the necessity of benchmarks and human-in-the-loop guardrails to ensure trustworthy AI outputs; (4) the importance of domain-specific glossaries or knowledge graphs for contextualisation; and (5) a sustainable data-economy model that aligns incentives, value and exchangeability. Action items include drafting the AI-readiness framework slide deck (Rohit), publishing machine-readable catalogues and glossaries (Rohit), extending Data Commons with contextualisation features (Prem), formalising the data-steward role and commercial licensing policy (NSO/Rohit), developing the answer-stability benchmark (Shalini), and promoting the data-boarding-pass and GIVE mechanisms to catalyse a formal data market (Shalini). The discussion closed with an invitation to visit the exhibition booth for a live demonstration and a reminder that building AI-ready data infrastructures is a long-term journey that must begin now to avoid future “holes in the rails” [408-410][401-406].
Deep work on working on fragmented data silos. As you all know that AI, it thrives on data. And today, most of the LLMs, what they have done is, they’ve definitely scraped internet and they’re doing really well. But the value of the work or what an answer an LLM would give is present based on what it can fetch from the actual data, which means in enterprises and organizations, there’s a wealth of information. There’s a wealth of information stuck in PDFs, stuck in documents, which people have a fear of not giving it to AI. So there is a fear, there’s lack of trust today, and that data… data, it stays where it is, like digitized. So, for example, there is there could be an entrepreneur, say in Nagpur, wanting to know about the scheme applied for the biotechnology plant that she wants to put up in Nagpur.
Now, if you see the MSME industry has a scheme for her, for women, for biotechnology. And, you know, it’s very good subsidy that that’s available. But where is it stuck? It’s stuck in a government notification which came out, which she’s not aware of. And what she is doing is she’s actually going to LLMS and asking that question and she’s not getting it. She’s also searching it on various places. She doesn’t get it. So that’s the divide, the information divide, which is existing. And the information which is there has which which is there stuck in documents or in even digitized form. That has to be AI ready. so that in a safe, trusted, and these two are very important, safe and trusted manner, the data can be linked, made useful and then made available.
Now, this is a long journey. This is a long journey. It’s not an easy journey because the data journey is about how you clean the data, you make it ready, you link it, you make it relevant, you make it useful and then present it in a manner so that the choice, and you want to have a choice of various elements of, you know, I mean, we live in the age of choice, right? We don’t want to be locked into anything particular. So that’s the data problem that we have in front of us. The opportunity is humongous because there is, I’ll give you an example, 3 ,000, I’m talking to an organization which does 3 ,000. Thank you. entities and the 3 ,000 entities actually manage 5 million new compliances in a year.
They have those kind of queries, 5 million queries on new compliances. Forget existing compliances because there are new compliances which get generated by the government, by various bodies and then they have to search. So the problem is humongous and it can be bridged. It can be bridged but we have to think about how to make data interoperable useful and AI ready. So with that background, I’d like to get into our panel and talk to some of the experts that we have today. My first question is to Rohitji who is from Mosby. So India generates a vast amount of statistical and administrative data. Mosby actually for all of you, it calculates a GDP for India. they have the source of all the data at village and taluka level so the data is there but as you think about making data AI ready what do you think is the responsibility of institution how and yours is an institution to make the data trusted safe and available to all
thank you Salini ji good morning everyone so trusted safe and ready for everyone AI ready my I like all of you to take a step back on this and just let us understand do you have uniform definition of what is AI readiness at this point in time do we have and I’ll not say that it’s not there in the ecosystem it’s there in the ecosystem that but do we have an agreement about it so there are two issues we need to understand when we talk about AI readiness of data. One is that, so let me just go back to today’s conversation I had with one of my colleagues over WhatsApp group, you know, we all are very active there.
So one of my papers has just been accepted in one of the largest conference and it’s about AI readiness of data and he asked me what’s so great about it. So I asked why, what is not so great about it? So he asked me, I put Bangla into ChatGPT and it completely understands. So what’s new you are doing? So the point I’m trying to make is people are not aware what it takes to make data AI ready. We all understand and then he asked me that no, but it’s not understanding and he talked about some of the dialect of this country and we have a huge number of dialects and Salneji, I asked him and he asked me that how do I train ChatGPT on this dialect?
I said, it’s not my job, it’s Sam Holtzman’s job. So the issue here. is that we don’t know. And that is the biggest responsibility of our institutions like MOSB to make people aware what AI readiness is all about. And then AI readiness means if I start, you know, talking about there should be a context file, there should be semanticity, there should be metadata, but many of us sorry about that, many of us it looks it would not make sense. So the first idea is to create a framework agreed framework, say people not only me, it’s not about my way or highway, me all of us work together create that framework, put it up for people to know.
I would do, the first thing I would do and I plan to do it literally is try to create a slide deck saying what AI can see and what human can see. So my folder, if it has 10 versions of budget 1, 2, 3, 4, 5, 6 and if I ask a question from that folder budget some answer will come from budget one some answer will come from budget two because unlike human where i am focused on this question ai is designed to take scan the entire thing available so it’s a big difference between human and ai i can be focused ai when once i give a thing to ai it will just scan everything it has in its domain so i would say starting point and just you know not taking much of a time uh starting point should be that let us create this framework let us have a shared understanding let us have a core ai readiness part and an aspirational ai readiness part and work on
yeah i think that’s very relevant because you cannot leapfrog into everything you have to be like i mean you can have aspirational but the foundation is very very important and and everybody joining that foundation that that that foundation exercise is is really important um i’ll go to you preb uh and talk about let’s talk about data Data Commons aims to make public data more accessible and usable. You’re from Google, and you have put all this in open source. You’ve been working on US Census data being available. Tell us some more about your experiments and how Data Commons is kind of ready or prepared to work on this challenge.
Thank you for having me here on this panel today. I think one of the areas I’ll start with is the importance of coming to that understanding on AI -ready data, but understanding that the field itself is moving quite quickly at the same time. So whatever agreements we come to today in six months, it feels like we’re dealing with a brand new technological landscape that we’re staring down. What Data Commons tried to do was say that if we can get… If we can get our data in that machine -readable format, which means… structured, which means machine -readable metadata also, and a format where that format specification is not stuck behind a 500 -page PDF, right? Can we make that in a way that the machine can understand it, interpret it, and then use it?
Our theory behind this is that idea of a knowledge graph from that data combined with the large language model gives you a much better chance of success to answer your question. So at Data Commons, what we try to do is we try to bring multiple data sets globally together in a common knowledge graph and then put an AI search engine on top of it so that you can quickly access that data. You can play with this yourself at datacommons .org. But what we did is we open -sourced the entire stack because this idea that that data is centralized with one source is also the dangerous part, and it shouldn’t be, right? The data should be federated.
It should be located at every organization and governed locally by the organizations that are… using it. And so one of the things we’ve done by open sourcing that stack is allowed, for example, the United Nations, the United Nations Statistical Department to use data commons as their back end. And so, you know, UNSDGs, WHO data, ILO data, so on and so forth, is all stored in this common interoperable database now, where instead of a data analyst spending 80 % of their time renaming column headers, they can actually focus on the data analysis so that we can get the impact and the outcomes we want to see. Hope that helped answer the question.
Yes, yes, no, absolutely. I’ll poke you a little bit more to understand on data commons, what’s a vision you have?
So a very simple vision, right, which is make data aware decision making the easy answer to take. Today, right now, the majority of the world is flying blind, whether you’re one of those 74 million MSMEs in India. you can’t afford a bevy of computer scientists and data scientists that you can hire you pay a tax to play with any data if you’re a policymaker thinking about climate change poverty education health these are holistic problems it’s no longer i can go to one ministry pull one spreadsheet and solve poverty i need to endemically understand how does education how does health outcomes how does income and economy how do all of these affect poverty locally right and that’s the problem we have today that the world is a multi -dimensional problem the other problem is our brains are not inherently multi -dimensional our brains are great in three dimensions you add a fourth dimension which is time and we’re okay right like look at climate change you add time and it’s greater than our lifetime we can’t think about it which is why we’re not solving it right but the majority of problems are 50 60 dimensional problems machines are really good at this but by the way.
And humans are good at using tools that are good at doing things we’re not. And this is where we have to approach AI as a tool we can use. Not as the answer, but as a tool we can use to derive the answer, to supplement our brains in the areas we’re
I’ll poke you a little bit more, but later on.
Saniji, I just want to take a second stab on that. And just a quick interjection on that. I’m a statistician. So I’ll be very happy if some of my work can be done by AI, you know, all those lab language models. I just read a paper today in the morning. It’s been written by two undergraduates from a Canadian university. And they said, and they proved it, that if you give same prompt to AI with the same data set, it gives you two types of analysis. So this is something I just wanted to flag. That we should not be really gung -ho about things, which is still untested. But yes, I would be the first to accept adopt an AI and use it for my work, but it needs to be, as you rightly put it, trustworthy.
Yeah, I just comment on this, the stability of an answer, that’s what you’re talking about. We are actually working to create a benchmark onto this because the same thing we are doing, like Amul AI was launched today in the morning, I mean, by the Prime Minister, and the same thing applies to Bharat Vistar, and we are actually working to see that the same question if you ask, multiple times across LLMs, and also to one LLM many times by different farmers, both options, you get different answers. And that, can we make it as a benchmark? That’s what we are working at also because this is a benchmark which is needed really on the ground, right? So that’s a part, so I wanted to comment.
I’ll go to Mr. Ashish. You’re from the industry, and you work with IIIT, . Bangalore. Tell us more about the research in the data area, plus how institutions can help build it all together.
Right. So I think my perspective is more as a practitioner because the last almost three decades I’ve been a solution builder. So I have seen data not from the data side, but from the solution side, trying to exploit it, trying to use it for the solutions. And I’ll come to the institution part of it. But, you know, when I look at the data and the challenges which are there associated with it, then for the last 10, 12 years, I’ve been for AI, for social problems or digital, like women in child health. I worked for almost a decade. Now, one of the problems that I realized is that the world is fast moving where you don’t manage a transaction.
You manage a journey. OK, and that is the agentic AI and all those things that we are talking about. Now, when I was working a few years back on the women and child data. I realize how fragmented it is the two main data sets if you look at a child’s health his anthropometric data, his nutrition data is with women and child development through their Anganwadi program if you look at the birth data, the humanization data and a lot of other data it is with the health and family welfare department and if you have to have an integrated decision making across for that child what needs to be done and then you have to look at both the data but that burden of orchestration comes on to the person who is solution making the data does not by itself flows through the workflow and that is one of my biggest problem that we have to solve that we look at data sets in isolation but we don’t look at how it flows through the process the second thing which I said the contextualization we all have read the book at least some of them that raw data is an oxymoron data always resides in a particular context and with some standardization associated with it so that you can make some sense out of it.
Now with education, when we are working recently, we realized that LLMs are becoming increasingly good, at least with the main languages, not with all the dialects, but in good translation. The moment they hit any domain -specific vocabulary, that’s when they start failing. Even the class 6th physics question, all these frontier models, is not able to properly translate. So we came up with a solution of using a glossary combined with the LLM so that it does a decent job in terms of overall translation. The user is transparent to contextualization. And the third thing which I faced a lot is that when we talk of public data, a lot of it is declared data and not verified.
Not verifiable data. Especially when a lot of planning depends on surveys. and lot of survey data is actually declared data whether you have a hypertension or not yes, no, whether you have this problem yes, no, what is the verification no doctor has actually verified that and you are going to make a decision based on that so in my opinion the AI ready data has to solve these three big problems it has to be interoperable it has to be contextual and it should actually the third problem that I was saying that you know verifiable, it should be verifiable and governable as an extension of that
very relevant I think you have posed the right challenge so Prem I am going to come to you right what is how let’s just pick one of them which is contextualization because I am increasingly seeing that domain information is needed and people are creating these glossaries to add like even in Agri when we had to roll out like we are going to do like we are going to do Mahavista, we actually created glossary of 5000 terms which is it is in Marathi so it has to be in Marathi and those terms being used and I know we did some experiments and we have created a sandbox environment you have done it for India so why don’t you explain that how contextualization and domain can be added to Google Data Commons and how it can be helpful
I think this idea of contextualization and localization is very important at the end of the day these are large language models, language being the key word there they’re not data models and so to what Mr. Bhardwaj said earlier what you want to be able to do is use them to write code to manipulate data because code is language but you don’t necessarily want them to be producing data on their own and one of those problems that you have today is also those large language models are essentially created largely off the web which has its own biases inherent in it, both language and locality -wise. And then on top of that, the example you used on the full folder of all the budgets, right?
The example I like to use for this is actually if you ask a large language model about a celebrity that recently had a breakup, they’ll tell you they’re together because it doesn’t know what just happened over the last month, right? It’s very sad. And so this is where you can use, though, the combination of, you know, you called it a glossary, I always call it a knowledge graph. What is that factual basis of information that I can put together? Now, it’s always going to be a subset of the whole, right? I might be able to cover maybe 0 .1 % of the world’s information with a knowledge graph. But if I can ground it in those facts, can I then utilize the intelligence of the large model to then help me produce some knowledge from those facts or fill in the gaps in those facts?
And so this, I think, is an opportunity that we actually have in the technology to move it forward. This is one of the areas that we’re actively working on as a team. But again, to do that, you first need that glossary of facts, right? This is where having that knowledge graph of statistical data, even if imperfect at this moment, because it is survey collected. It is dependent on the quality of the question asked, the error bar shown, the quality of that metadata, so on and so forth. But it is a starting point from which you can get more information and use that intelligence to potentially even find those outliers or areas that don’t match what you might be hearing on the ground.
So that’s the opportunity I think that we have.
Because I absolutely agree with you, but I will say it in more direct terms. Because sometimes we feel that LLMs or in previous version, the AI models are the solution. They are not the solution. They are only one of the inputs to the solution. And they comprise 10%, 15 % of what you’re trying to do. It is what is the rest of 85%. is doing yes llm will give different answer how are you compensating with guardrails human in the loop risk assessment these are the tools which are available today so i if you have to build because at the end of it it’s a probabilistic model okay come what may and i was talking to a mathematician from mit and he explained why it will never become perfect why it is it is grounded that fact is grounded in mathematics that it is it cannot ever become as perfect that every time consistent that we are wanting it to be ever because then you are taking the main source of its creativity away from it so what you have to focus is outside not inside that that’s all i ever wanted to say
if i agree with you completely and i started by saying it’s a tool right and we use tools to supplant ourselves not to replace ourselves right to supplement our knowledge not to replace our knowledge so i do agree with you it’s a tool but we have to be careful throwing the baby out with the baby and we have to be careful with the baby and we have to be careful with the baby and we have to be careful with the baby and we have to be careful with the bathwater here in the sense that That tool now makes things available to the average person. It upskills the average person in a way that they couldn’t themselves before.
So if we immediately go to put guardrails, prevent access, things like that, we’re preventing a large part of society. And I’ll say as somebody who worked on Google Search for many years, there were many arguments in Google Search that we, for example, shouldn’t put health information on search. Because the average person isn’t smart enough to be able to deduce information about their own health from Google. But the average person can’t afford a doctor also, right? There are endemic problems in society that prevent you from doing that. So does the answer to that question suffer, or does the answer to that question do less harm and give people a pathway that they can learn from? And so that’s an important question to ask ourselves here as we think about AI, which is, yes, it is imperfect at this moment.
Can we understand? Can we educate? Can we work inside the system that exists? we can’t ignore it either. We can’t say it made one mistake, therefore I will not use it. And I will also call out the imperfection of us as humans is also very much there, right? So there are many times we look at these systems and, you know, we look at, you know, a way more autonomous vehicle and we said, look, it had six accidents last year. The 30 ,000 deaths from car accidents in the U .S. a year, right? And so statistically speaking, this is still much safer, right? And so these are the sorts of examples that we have to look at, understand where to apply it, how to apply it, and what the overall societal good is from using it.
Yeah. No, thanks. I think a very relevant discussion that we are having, and there’s always a fight between should we have RAG architecture or should we just, you know, teach, give all to LLM to do it because it has more capacity and more, you know, GPU. But either or is not possible. There’s like, it’s like so much about the world. It’s like, you know, it’s like, you know, it’s like, you know, it’s like, you don’t want to give you maybe want to keep the data and the sovereignty comes in. it a lot. And this has been a discussion in the last two days. Most of the panels that I have been that you want to keep your data.
Countries want to keep the data with themselves and they actually don’t want to train because choice of LLMs is like you want a lot of choice and you want to use here, there, everywhere. So I’ll come back to you Rohitji and see we talked about administrative data and you talked about a framework. So my question is that how do you think alternate data, secondary data beyond administrative data, how can that be also brought in and your framework which you talked about that there should be a foundational framework if that framework is adopted by industry. One, is it possible? And two, what kind of data economy it can start?
So this is early morning. Let me take an audience poll on it. How many of you think that what Salini asked is a governance issue? Or is it a, I mean, just raise your hand if you feel it’s a governance issue. Anyone who feels it’s a governance issue? How many of you feel it’s a technological issue? What she asked. How to make alternative data ready for AI. That’s what the question was. So how many of you feel it’s a technology? There’s no prizes for it. There’s no punishment for it. So feel free to raise your hand the way you think. It’s a technology. So, okay. So I am with that gentleman. I feel it’s a governance issue.
And I’ll also work on it. So what are we talking about? We are talking about data generated from different sources, be it alternative data sources, be it like administrative data sources. The panelist with my co -panelist just talked about getting data from different sources not aligned to each other. So it’s a governance issue which we need to understand first. We need to create. And, of course, I completely agree with Salini when she said that we need a federated model. Perhaps Prem said that. We need a federated model. There cannot be one whole sole owner for a data. of this country or for that matter for any country what as as somebody needs to play the role of data steward somebody needs to orchestrate this data ecosystem and that perhaps being from nso i have my own biases i’ll say nso can do it but of course that’s something for the people to decide now let’s understand this what do we need when we need ai ready data we need first a cataloging of it i’m just going to take one minute on it cataloging of it you should have everything catalog any industry any government organization this is my data set these are the indicators these are the definitions and so on and so forth i’m not getting that deep into it you need a catalog of your data and if that’s not there second thing is that catalog should not be pdf that catalog should be as she was saying machine readable json file probably you need a catalog of your data and if that’s not there you need a catalog of your data and many other ways are there but let’s talk about you JSON file.
Second point, you should have metadata for it. If you don’t have metadata for it, I mean other day I was with another panel with Prem, I said the thing which irritates me the most is lack of metadata. I don’t know. I’ve been driving in blind. I don’t know what the word frequency means. It may mean hundreds of things. So you should have metadata and again not in PDF. So when I’m, whatever I’m talking about is, I’m not, I mean JSON or XML, there are so many ways, but machine readable. Let’s put it that way. Third is, you should have a context file. So now machine has read it. Now but it wants to know that where do I find the meaning of frequency?
So machine should have a context file where the source is written. You go there and see. You will find the meaning of frequency. So metadata will not have frequency, meaning of frequency. It will only write frequency means quarterly. So machine now needs to understand what does that frequency means. So that’s what she was talking about and Tim again was talking about. We need to have a, that makes us, bring us to the, we need to have a business glossary. We need to have a business glossary. He also talked about a knowledge graph. I mean, just a sophisticated version of business glossary. That we need to have. So once we have sorted this out, we need to work, what type of codes are we working?
So the gentleman just beside me just talked about two data sources using different codes for different thing. I mean, same thing. So then we have to standardize that codes. And then lastly, we have to structure our data. Data needs to go in a structured database. It should be defined and that’s not new I’m talking about. It should be defined by dimensions. It should be defined by attributes. It should be defined by its role. So time means temporal. You can’t write time and expect LLM to understand what does time mean. You have to say time means temporal. And once you have these ready, these available in a, so there are two use cases. And just last, last quick of the One is that am I using it for my own use case?
Am I training my own model for it? Then I can put all these in one file and feed it to my model. But if I’m expected to create a MCP for my database, then I have to create separate files, put it up in a URI or URL where any model can go, the connector can direct it to that model, that place, that resource place, and then the things happen. And this is all I’m talking from my personal experience when we, and Salneji knows about it, when we developed our own MCP server
loving it the amount of reach out which has happened to use the data data sets for you know you can actually find out you can ask a question of what’s been the price of how has the price of moong dal been in the last whole year or whole quarters or month wise so that that capability is there now and it has happened because it was always there they do the calculation of the wholesale price index the commodity price index so that in from the data was there it’s just that now it is ai ready for people to consume take and and then ask and it is connected to claude and chat jpt ashish i’ll go to i’ll go to you uh for because building on what uh roheji stopped at which is the use cases and you come from the solution part of it uh how do you visualize and imagine solutions and use cases and how do you visualize and imagine solutions and use cases you combining say administrative data and alternate data I’m not going into personal data because there’s a lot of consent there but at least a lot of secondary sources of data which is available and how do we combine and make it more powerful
I think as you rightly pointed out I come from the solution perspective and a solution now with agentic AI coming in we look at every solution in form of a journey. We are going past the mechanism of point solution that you ask it reverts back to the answer and now the use case has to decide at which part of the journey what data is that you need and that will dictate whether it is additional data sets which are outside or it is a public data set it will be due. The only challenge which I see here is the who is accountable for that data Thank you. Thank you. in the solution at the API level, at the policy engine level, which are actually going along with the solution, and it should happen, it should be enforceable automatically.
If you are thinking that a human being will actually enforce that policy, it will break. It will break in no time. So that is what we are trying to do, is to create those reusable artifacts as DPIs or DPGs, it will fall into one of those categories. But where it allows those policies to be set for a data set in an easy reusable way so that everybody doesn’t have to recreate from scratch those kind of policies, and then that’s the way to move forward.
You mentioned your lab. I’m sorry, I just spoke you into that. Tell us more about your lab. What more work they are doing?
So that’s my current job. Previously I was heading a Gen AI company, by the way, and I will talk separately later on PDF challenge, which we thought we had solved it. We didn’t fully, but we were on the way. But the current lab, which is very exciting, it’s a collaboration between Microsoft and IIIT Bangalore. A4I stands for AI Innovation for Inclusion Initiative. That means we create large scale. The idea is not here to run pilots that we do this small thing here, we diagnose, not that. It should be population scale and we want to launch it as a DPG so that it can be largely. So we are working on education, school education area.
We are working with teachers in terms of making their life easy. We are working in terms of accessibility. How blind children can actually be taught STEMs so that they can actually become a mathematician. They can hope to become a physicist, mathematician. Today it’s very difficult. How to even read a book? And the third one we are doing is working with the last mile health workers. Our current solution is a rack based AI combination, but we are looking at exactly that problem that you mentioned that either it is this or that. I think there are plenty of answers which are in between. The. That was what we are exploring.
Thank you. Thank you so much. Prem, I’ll again build on the concept that we were discussing on the use cases, which can be. I mean, I just want you to paint a picture of if you have data in knowledge graphs, like what you mentioned, if the data is there and data commons is present. I just want you to visualize that what more use cases can be possible with secondary data. How can India benefit and not just India, Global South benefit from this? And please feel free to paint the use cases which you have built in the sandbox environment that you have. You can just take those examples.
Yeah, I’ll give two very. These might not be exactly where the sandbox is today, but where it could go tomorrow. Right. And so I’ll give two very different examples here. One is. At the end of the day, the Ministry of Statistics does a lovely job collecting as much information as they can. The whole ministry does. The government does. it’s a top -down data collection.
I’m sorry, I’ll just interrupt you. I think Rohitji will say it’s not top -down. It’s actually at the field level, it’s bottom -up.
That’s fair, that’s fair.
He will say that, it’s bottom -up.
That’s fair, that’s fair. You’re correct, it’s bottom -up. That said, we have alternate data sources also that are there. Sometimes they supplement and they further show, yes, the data collected is correct. At times they disagree. And those disagreements are also interesting to understand to the point of where is the survey question flawed or where is the civil society seeing something or has visibility into something that we don’t have access to. And so the more of these data sets that come together, these points of friction, again, this is where the human intelligence comes in. Show me the points of friction. I have a haystack full of needles. Which needles do I pay attention to? Right? So this is one example if I’m at the government or the statistics, you know, ministry of statistics level.
Now let’s go to the completely opposite end. I’m a small business owner. I’m setting up a physical shop. Where should I set it up? Right? Where I set it up depends on mobility traffic, depends on the demographics and affordability in that space, depends on all types of things. Right? It’s a large data question. But that MSME owner is often ill -equipped to answer any of those questions, is often taking a shot in the dark. And that shot in the dark is a costly shot in the dark if they’re wrong. Right? Because they are taking the full risk of that decision. Now with the data commons that we’re building, the question becomes can we reduce that risk for that individual?
Can we help them model, understand, de -risk the decision they’re making? And that’s what we’re doing. And that’s what we’re doing. based on the audience they want, based on the footfalls they want, based on the location that they’re choosing. That’s a very specific example now. But these are two very opposite examples of how bringing all of this data together, which we often think about as more aligned towards, you know, the international organizations or the government minister, but is actually usable on the ground by an individual too.
Tell us a bit more about, like if suppose someone wants to put up a Data Commons instance, how can they get started?
It’s actually quite simple. It’s easy enough that I can do it myself, which means you can. But it’s datacommons .org is an open source platform. We have a 20 -minute guide to get started. You can set the whole thing up on your computer, have your CSV data set, bring it in. And the thing is, once you bring one data set in, it overlays with all the data sets already in Data Commons. This creates sort of a network effect between the two. To the data, right? So once I bring in, you know, if I am a chain store in India trying to figure out that next store location, if I bring in all my per store sales revenue data once, then suddenly I can compare that to the 50 ,000 data sets and overlay them that are already in data comments.
Before, if I wanted to do this as a chain store in India, I would normally have my people come up with maybe 10, 12 different hypotheses. Because then I have to get those 10, 12 different data sets and I have to form 13 different data transforms, right? So they’re all in the same format. That prevents us from being able to have that level of creativity we want where we can look across the entire landscape of the problem set. And so this is sort of one of the things.
Right answer. And it was a matter of trust for NSO also. That, you know, people are getting different answers for the data which is created by NSO. That makes sense. It makes us look toward MCP server. A, it is open. so it makes our data interoperable for all the almost all the AI system. I am not saying all the AI system. Otherwise what would happen? Be aware that every LLMs have their own standards of API. So you create those APIs first and then you somehow manage the LLM to approach that API. With this connector, it’s like C socket for the phone charger, if I may use the parallel, where you can just plug in any C socket you can use for anything.
That’s what the MCP is. So data comes and the LLM comes and plugs into MCP. And it allows any LLM to come. But what you have to do now, that you have to connect that small tool with that LLM. So that’s a one minute job and it’s available on our website. You go there, www .mosfet .gov .in and the offering section, everything is available. You can do it in one minute. One minute, maybe two minutes at the most. Anyone. But still there is one challenge, which I must tell you, is that somehow need to ensure that this becomes a default tool. The user does not have to add it. Somebody says somebody forgets it. Then the same situation starts happening again.
So right now people have to add it to their tool. But the biggest advantage I see is that people don’t have to come out of their workflow. So if I have taken a very costly pro cloud, then I don’t have to come out of it. Go to my portal to get the data analysis. I can keep using the intelligence of cloud or chat GPT. I don’t have a preference for that. With the verified data, as he talked about, verified data of Mospy. And the use cases are innumerous on the web now. I mean people have just lapped it up. My favorite is that there is a Tamil song which talks about a lot of grains.
So one of the messages I got, and I’ll share the link also. It’s on Twitter also. I mean X now. That somebody created a CPI for all the grains which was talked about in that CPI is consumer price index which basically talks about inflation uh which talks about all the grains and they just took the grain out of the uh song that you know now wheat and and created CPI index for it and they have named it like p index or something which is like songs name so I’m not very conversant pardon me for that in Tamil but I’ll share that link so that’s my favorite use case so people so what I mean to say that people can use the data the way they like it that’s the that’s the bottom line and that’s that’s what the NSO’s idea
is most interesting use case I would have I would have seen and I really want to see uh it and and say yeah yeah I’ll have a look at it so one more thing which I want to tell uh the audiences that uh the work uh see several like the use case uh Rohitji mentioned about that someone can just pick the data uh so we have created a data and we have created a data and we have created a data and we have created a data and we have created a data and we have created a data and we have created a data and we have created a data and we have created a data and we have created a concept called as data boarding pass concept called as data boarding pass concept called as data boarding pass so this is a data boarding pass so this is a data boarding pass This is like for AI ready India.
This is a physical copy, but actually the concept is that once your data is ready and it is it has a set of checklists. Which it passes, then as a B2B player, you could be a policymaker, you could be a researcher, you could be a market player wanting to build on top of it. You can take this, you know, this concept of data boarding pass and get onboarded onto the date or for the data usage so that you can pick the data and then start using it in your applications. So data boarding pass is, say, at a district level, you have and I’m just painting a scenario. You have a data commons where graph knowledge graph and data have been all combined together, created all together, right context and everything.
And some organization. Now wants to know, say, the automobile. MSME manufacturer wants to access it and give information to dealers as to where scooters are being sold, where motorcycles are being sold and what’s been the income of of that region over a period of time that that can be possible now. Right. So the data boarding pass enables it, makes it possible. And if you want to physically see it, how this exactly works, visit our booth at a step foundation on Hall three in on the first floor. Do visit that. And my team would be there to show you the actual generation of the data I think we have given a lot of things. I want to just, you know, we have less time, but I want to take a couple of questions from the audience.
So feel free to ask. We have four minutes so we can have like two, three questions from the audience. audience. I saw that first, sorry, and then I saw you. So next to you. Yeah, please go ahead. Can someone give him a mic, please? Otherwise, I’ll hand mine.
Thank you very much. I wanted to ask you about the business models of these platforms because it is obviously extremely important to have high -quality data, but high -quality data is also expensive to collect, to maintain in the time. So did you work, besides, on how you can maintain these kind of platforms during the time? Does it have to be, I don’t know, publicly paid or whatever models you may have? And it’s also for everybody, I think.
Go ahead, then I’ll also add. Then I’ll also add.
So, Jasmo, I just have a quick clarification on that. And National Statistics Office India is fully funded by the Government of India. It’s a… I mean, as we all know, National Statistics Office India is fully funded by the Government statistics office over all over are like public funded through public money. So it’s our job to create data and make it available to the public. At the same time, just one quick disclaimer on that, that open data is not free data. So somebody has paid for it. So when depending on the use, we provide the data. So if the use is research and things like those, I’m not getting into details of it, then it’s free.
But if the resource, you know, the use is commercial, then, of course, there is a system. There is a policy for it. And people have to pay accordingly.
Yeah. So I’ll also answer it because we have done a good amount of work. I would encourage you to see a paper that I’ve put up on our People Plus AI website, which talks about the give data, give model for data. G is guaranteed trust. And we talked about it. I is incentive. Incentive. Why should I bring the data? What will I get it get from it? The V is the value. If the data has no value, nobody is interested. And E is exchangeability. right which is can i share the data so i’ll focus on the i the incentive there has to be an incentive for someone to bring the data and there has to be an incentive for someone to use the data and that value will be monetized that is the data economy if you ask me this data economy is actually running without a formal mechanism there’s good amount of money people in selling data buying data lead generation i mean there’s huge amount of things which are happening this formalizes that so they will be but what will be the price that the economy the data economy i mean that has to stabilize that has to happen at region level with private sectors so we have been working in that direction so that the incentive model is clear but the actual price is is a discovery mechanism
and it’s very uh very interesting to hear all this that’s amazing one of the very key scenario that we see every day and we get little bit trouble is we see a road making getting made and stuck after few days I mean yeah it might not feel good but that’s how it is because it somewhere it feels like a disconnection in the data or somewhere decision in the policy making so do we have some way to kind of get this kind of pieces applied in those like know whatever the tender ecosystem or whatever that like you know you have a road made and then a duck for a pipeline after a very short window
yeah maybe I’ll answer it see if you see India has put the whole digital public infrastructure in place these are the DPI thinking whether UPI Aadhaar DigiLocker DigiYatra they were about digital rails which were put together this data infrastructure that we talked about today is going to be that rails is it going to be dug up maybe maybe maybe no problem Promises, right? Is it going to be dug up? Are there going to be holes in that? Maybe. But I think it’s a journey that if we don’t do it and don’t start it now, it’s going to hit us later on. So no promises, but yes. Rohit, do you have to add anything on that?
I just wanted to add that we need to keep working on these data sharing platforms and all the philosophies we just talked about, like accessibility, sharing, analysis, use of AI, and things will improve slowly but steadily, I’m very sure about it.
Time is up, and the next session is going to start. So thank you so much for listening in to the AI -ready data, and please visit the booth to see it actually in action. Thank you. Bye. Bye. Bye. Bye. Thank you. Thank you. you you you you Thank you. Thank you. Thank you.
-Data Fragmentation and Silos: The discussion highlighted how valuable information remains trapped in PDFs, documents, and isolated systems across enterprises and government organizations. This create…
EventExamples include delays in issuing birth certificates in Papua New Guinea due to lack of coordinated data systems, and Fiji’s e-bus ticketing system creating barriers for elderly and rural communities…
EventData is extremely siloed and still available in paper format in many situations
EventAt thetechnical level, data needs standards in order to be interoperable. Here, the work of standardisation and technical bodies becomes essential.
BlogThis comment provides a systematic framework for thinking about data preparation for AI, moving beyond generic discussions to specific, actionable requirements. It’s insightful because it breaks down …
EventAudience questions and Sharma’s responses highlight specific applications: agricultural models that can analyse visual data to detect nutrient deficiencies in plants, healthcare models for livestock i…
EventAudience member (Piyush)
EventPellerin Matis: I think government can really learn from the private sector because there is lots of technologies and solutions which have already been implemented in the private sector that can eas…
EventThe fourth one is safety. When we build LLMs, usually we do some safety alignments with reinforcement learning, but these safety are mainly done in English and some of high -resource languages. Now, w…
EventThe discussion revolves around the topic of artificial intelligence (AI) and large language models (LLMs). One viewpoint argues that people should be critical, skeptical, doubtful, and ask tough quest…
EventHelani Galpaya:Okay, I mean I’ll go on the data part I think. Sort of the superficial answer is it’s actually very difficult to get the incentives right for data sharing, right? Data is power and ther…
EventDevelop marketplace mechanisms for incentivizing data contributors through revenue sharing models
EventThe discussion began with a serious, concerned tone as panelists outlined cyber threats and challenges. As the conversation progressed, there were moments of tension and disagreement, particularly aro…
EventThe tone began optimistically with audience engagement but became increasingly concerned and urgent as panelists revealed the depth of AI-related challenges. Sherry Turkle acknowledged being “the Grin…
EventThe discussion maintained a serious, concerned tone throughout, reflecting the gravity of the challenges being discussed. While panelists acknowledged the severity of current threats to information fr…
EventThe tone begins as analytical and educational but becomes increasingly cautionary and urgent throughout the conversation. While Kurbalija maintains an expert, measured delivery, there’s a growing sens…
EventThe tone began as deeply concerning and urgent, with speakers emphasizing the gravity and scale of the problem. However, it evolved to become more solution-oriented and cautiously optimistic by the en…
EventThe discussion maintained a constructive and collaborative tone throughout, with speakers building upon each other’s points rather than disagreeing. There was a shared sense of urgency about the need …
EventThe discussion began with a notably realistic and somewhat pessimistic assessment of global cooperation challenges, but progressively became more optimistic and solution-oriented. The moderator explic…
EventThe discussion maintained a cautiously optimistic tone throughout, balancing enthusiasm for AI’s potential with realistic concerns about its challenges. While speakers acknowledged significant risks a…
EventThe tone was consistently optimistic yet pragmatic throughout the conversation. Speakers maintained an encouraging outlook about AI’s transformative potential while acknowledging significant challenge…
EventThe overall tone was collaborative and solution-oriented. Participants shared insights from their regions in a constructive manner. There was a sense of optimism about the potential of new technologie…
EventThe discussion began with a collaborative and appreciative tone as various stakeholders shared their visions and commitments. However, the tone became increasingly tense and critical during the explan…
EventThe necessity to monitor red lines while finding agreement outside these lines was highlighted.
EventThe discussion maintained a diplomatic and constructive tone throughout, with participants demonstrating nuanced thinking about complex trade-offs. While there were clear disagreements about the level…
EventThe overall tone was serious and somewhat cautious, reflecting the gravity of cybersecurity challenges. While the speakers emphasized the need for cooperation, there was an undercurrent of concern abo…
EventThe tone was consistently optimistic and collaborative throughout, with speakers expressing excitement about AI’s potential and India’s opportunities in the space. The discussion maintained an educati…
EventThe tone is passionate and advocacy-driven throughout, with the speaker maintaining an urgent, morally-charged perspective. It begins with a personal, somber story but transitions to an increasingly o…
EventThe discussion maintained a consistently collaborative and optimistic tone throughout. It began with academic framing but quickly became practical and solution-oriented as panelists shared real-world …
EventThe tone is consistently celebratory, optimistic, and forward-looking throughout the discussion. It maintains an enthusiastic and grateful atmosphere, with speakers expressing appreciation for partici…
EventThe tone is consistently optimistic, motivational, and action-oriented throughout. The speaker maintains an enthusiastic and inclusive approach, emphasizing collective effort and shared responsibility…
Event“Enterprises and governments hold vast quantities of information in fragmented PDFs, legacy systems and isolated silos.”
The knowledge base explicitly notes that valuable information remains trapped in PDFs, documents and isolated systems across enterprises and government organizations, confirming the claim.
“The ‘information divide’ prevents entrepreneurs and citizens from accessing relevant data such as government notifications.”
The source describes an information divide where entrepreneurs and citizens cannot access relevant data, corroborating the statement.
“A lack of trust in sharing data with AI systems compounds the information divide.”
The knowledge base highlights the need for a trust infrastructure so users feel comfortable with AI outputs, adding nuance to the claim about trust issues.
The panel shows strong consensus on four pillars: (1) a common, technically detailed definition of AI‑ready data; (2) governance‑first, federated stewardship of data; (3) the necessity of benchmarks and human guardrails for trustworthy AI; (4) the role of domain‑specific glossaries/knowledge graphs; and (5) the need for incentive‑based data economy models.
High consensus across technical, policy and economic dimensions, indicating that future work should prioritize coordinated standards, federated governance structures, and sustainable financing mechanisms to unlock AI‑driven development.
The panel shows moderate disagreement centred on technical implementation choices (catalog vs knowledge‑graph) and the role of AI relative to human decision‑making, while there is broad consensus on the need for governance frameworks, trust, and federated stewardship. The disagreements are substantive but not polarising, indicating that collaborative standard‑setting and pilot projects could reconcile the differing viewpoints.
Moderate – differing technical preferences and philosophical stances on AI’s authority, but shared commitment to governance, trust and open‑source solutions, suggesting that coordinated policy and technical work can bridge gaps.
The discussion evolved from a broad problem statement about fragmented data silos to a multi‑layered roadmap for AI‑ready data. Key turning points were triggered by comments that exposed foundational gaps (Rohit’s call for a shared definition of AI readiness), proposed concrete architectures (Prem’s knowledge‑graph + LLM model), highlighted practical challenges (Ashish’s journey metaphor and glossary solution), and demanded accountability (Rohit’s benchmark concern). These insights prompted participants to converge on a common language—metadata, catalogs, federated governance—and to envision operational tools such as the data boarding pass and the GIVE economic framework. Collectively, the highlighted comments steered the panel from abstract concerns to actionable strategies, balancing technical possibilities with governance, trust, and sustainability.
Disclaimer: This is not an official session record. DiploAI generates these resources from audiovisual recordings, and they are presented as-is, including potential errors. Due to logistical challenges, such as discrepancies in audio/video or transcripts, names may be misspelled. We strive for accuracy to the best of our ability.
Related event

