Toward Indigenous AI: A Critical Analysis of BharatGen’s Role in Data Sovereignty and Language Equity

This article critically examines BharatGen’s role in advancing Indigenous AI, focusing on data sovereignty and language equity in India.

Praveen Kumar Myakala

Vijayalaxmi Methuku

Srikanth Kamatala

Jun. 09, 25 · Analysis

Likes (72)

Comment

Save

10.3K Views

Abstract

This study critically examines BharatGen, a government-backed initiative to develop India’s foundational multimodal and multilingual Large Language Model (LLM), as a transformative step towards indigenous Artificial Intelligence (AI). In a landscape dominated by global LLMs, concerns over data sovereignty and underrepresentation of non-English languages have become increasingly salient. This study analyzes BharatGen’s role in addressing these issues by enhancing national control over digital data and promoting language equity across India’s diverse linguistic spectrum. This study explores BharatGen’s strategic significance in reducing dependence on external AI ecosystems, its alignment with India’s national AI policy objectives, and the challenges and opportunities associated with its deployment. Ultimately, this study argues that initiatives such as BharatGen are vital not only for technological self-reliance but also for preserving cultural identity and ensuring linguistic inclusivity in the evolving global AI ecosystem.

Keywords

Indigenous AI, BharatGen, Data Sovereignty, Language Equity, Multilingual AI, Indian Language Technologies, Foundational Models, AI Policy in India, Digital Inclusion, Multimodal Language Models, Ethical AI, AI for Low-Resource Languages, AI Localization, Sovereign AI Infrastructure, National AI Strategy

1. Introduction

The rise of large language models (LLMs) has ushered in a new era of artificial intelligence (AI), enabling machines to generate, understand, and interact with human languages across diverse modalities. However, much of this progress has been centralized within Western institutions and trained predominantly on English language data. This has raised critical concerns around linguistic inequity, cultural erasure, and algorithmic coloniality [1–3]. These issues are particularly important for multilingual and culturally diverse nations such as India, where the dominance of English-trained models often marginalizes regional languages and undermines the goal of inclusive digital transformation.

To counter these trends, the concept of sovereign AI has attracted global attention. This approach emphasizes AI systems that are development, governance, and trained using local infrastructure and data sources [4,5]. India’s response to this paradigm is reflected in the launch of BharatGen, the country’s first indigenously developed, government-supported, multimodal, and multilingual foundational LLM [6,7]. Spearheaded by the Department of Science and Technology under the National Mission on Interdisciplinary Cyber-Physical Systems (NM-ICPS), BharatGen aligns with India’s broader digital policy goals, as outlined in the National Strategy for Artificial Intelligence [8] and the INDIAai mission [9].

BharatGen is set apart by its foundational focus on language equity and data sovereignty. By supporting 22 scheduled Indian languages and incorporating culturally contextualized datasets through the Bharat Data Sagar initiative, it directly addresses the challenges of underrepresentation faced by low-resource languages in mainstream AI [7,10,11]. This approach not only expands linguistic access to AI-powered services, but also strengthens national control over digital infrastructure and content, reinforcing India’s strategic autonomy in the global AI ecosystem [5,14].

The significance of the BharatGen extends beyond its technical achievements. It intersects with broader movements in indigenous and decolonial AI scholarship, which advocates technologies rooted in local knowledge systems, ethical practices, and community accountability [12,13]. BharatGen’s public–private consortium model, multilingual orientation, and policy-aligned framework represent a promising step towards operationalizing these principles at the national scale.

This paper critically analyzes the role of BharatGen in advancing indigenous AI in India. Specifically, it investigates how the initiative contributes to the twin goals of data sovereignty and language equity, evaluates its design and ecosystem through existing multilingual and multimodal AI frameworks, and situates it within the global efforts to promote culturally grounded and inclusive AI. Through this lens, the study aims to offer an evidence-based understanding of how national AI initiatives, such as BharatGen, can serve both as technological innovations and instruments of digital self-determination.

2. Related Work

The development of indigenous and multilingual AI systems has received growing attention in response to the dominance of English-language datasets and Western-centric machine-learning paradigms. Scholars have argued that large language models often perpetuate global inequities by marginalizing low-resource languages and reinforcing epistemic and cultural biases [1–3]. These critiques have catalyzed new approaches to AI development that emphasize inclusivity, representation, and linguistic justice.

Sovereign AI, as both a policy framework and technological vision, has emerged to address these issues by advocating for AI systems developed, governed, and trained with local data and infrastructure [4,5]. National initiatives across France, China, and South Korea reflect this shift, and India’s BharatGen exemplifies how sovereign AI principles can be localized and scaled in the Global South [6–9]. Built as a foundational LLM with multimodal capabilities, BharatGen embodies India's strategic objective of aligning its AI ecosystem with the national priorities of inclusion and autonomy.

In the Indian context, research efforts such as IndicNLP [16], IndicTrans2 [15], and datasets curated by AI4Bharat [17] have advanced the support for regional languages through high-quality corpora and translation systems. However, these fragmented efforts have historically lacked integration into the national AI framework. BharatGen attempted to bridge this gap by incorporating these assets into a unified model architecture and national infrastructure.

Indigenous and decolonial AI scholars have further contributed to this discourse by emphasizing the importance of AI systems rooted in local knowledge traditions, community protocols, and relational accountability [1–3,12,13]. Their work challenges the assumptions embedded in the dominant AI pipelines and offers ethical and epistemological alternatives. While BharatGen is not explicitly framed within these scholarly traditions, its commitment to linguistic inclusion, cultural grounding, and public sector development reflects several principles aligned with this body of thought [7].

Recent technical studies have also highlighted the persistent challenges in developing LLMs for low-resource languages, including limited labeled data, domain-specific benchmarks, and architectural bias towards high-resource languages [10,11,14]. Initiatives such as Stanford HAI’s policy reports [11] and research on federated learning in multilingual contexts [14] have provided strategic pathways for overcoming these barriers. Meanwhile, language preservation initiatives leveraging generative AI explore both the promise and pitfalls of using such technologies to support linguistic diversity [13].

In summary, BharatGen sits at the intersection of multiple strands of work: sovereign AI policy, multilingual NLP infrastructure, and culturally responsive AI ethics. Its ambition to develop an inclusive, India-centric LLM builds upon prior data and policy initiatives [5,9], and engages with theoretical frameworks that seek to realign AI development with local values and needs.

3. Research Objectives and Questions

The objective of this study is to critically analyze BharatGen’s role in shaping a sovereign and inclusive AI ecosystem in India, with a particular focus on two interrelated domains: data sovereignty and language equity. As India advances its digital public infrastructure and foundational AI capabilities, initiatives such as BharatGen offer a unique opportunity to examine how national-level AI development aligns with the principles of cultural contextualization, linguistic diversity, and ethical autonomy.

This research builds on the understanding that large-scale AI systems are not neutral infrastructures but sociotechnical systems that reflect and reinforce particular power structures [1–3,12]. By evaluating BharatGen through this lens, this study seeks to assess both its technological potential and socio-political implications within the broader framework of indigenous and sovereign AI.

Specifically, this study addressed the following research questions:

RQ1: How does BharatGen contribute to the advancement of data sovereignty in the Indian AI ecosystem?

What mechanisms are in place to ensure national ownership, storage, and governance of training data?
How does BharatGen’s infrastructure compare with global sovereign AI efforts in terms of autonomy and localization?

RQ2: To what extent does BharatGen address the challenges of language equity in large language model development?

How inclusive is the model in terms of linguistic diversity across India’s 22 scheduled languages?
What approaches can be used to overcome the limitations of low-resource languages in the training and fine-tuning stages?

RQ3: What are the broader implications of BharatGen for ethical, inclusive, and culturally grounded AI designs?

How does the initiative reflect or diverge from indigenous and decolonial principles of AI?
What are the risks and opportunities of public-private partnerships in shaping India’s sovereign AI future?

These research questions serve as analytical anchors for the remainder of this study. They guide the evaluation of BharatGen not only as a technological system, but also as a strategic artifact shaped by policy, cultural values, and the evolving landscape of global AI governance.

4. Methodology

This study adopted a qualitative, interpretive methodology grounded in critical AI studies, policy analysis, and socio-technical systems theory. The aim is not to benchmark BharatGen as a technical artifact in isolation but to examine its development and deployment in relation to national policy goals, ethical frameworks, and language inclusivity imperatives.

4.1 Analytical Framework

To evaluate BharatGen’s role in advancing data sovereignty and language equity, this study employed a multidimensional analytical framework with three core lenses:

Sociotechnical Systems Perspective
This lens situates BharatGen as part of a broader system of interactions among technology, institutions, data practices, and societal needs. It emphasizes the influence of national priorities, governance structures, and public-private partnerships on the shaping of AI systems [3,5,9].
Critical and Decolonial AI Frameworks
Drawing from Indigenous and decolonial AI scholarship, this perspective evaluates whether BharatGen embodies principles such as epistemic inclusion, community-aligned design, ethical relationality, and resistance to extractive data practices [1–3,12,13].
Language Technology and Multilingual NLP Evaluation
This dimension assesses the model’s linguistic scope and technical choices, including the use of datasets such as IndicNLP and IndicTrans2 [15–17], its approach to low-resource language modeling [10,11,14], and its alignment with current best practices in multilingual LLM development.

4.2 Data Sources and Material

This research is based on an in-depth review and triangulation of the following sources.

Primary policy and institutional documents included press releases [6], official websites [7,9], strategy papers [8], and publicly available technical documentation from BharatGen and INDIAai.
Academic and industry literature: including peer-reviewed papers, arXiv preprints, and research reports covering sovereign AI, multilingual NLP, and Indigenous AI ethics [1–4,10–20].
Comparative initiatives such as global sovereign LLM efforts (e.g., BLOOM and WuDao) for contextual benchmarking.
Analytical coding Key themes and patterns were derived through thematic coding across these documents to map BharatGen’s strategies against the research questions.

4.3 Limitations

While this study provides a critical and conceptual evaluation of BharatGen, it does not include empirical model performance benchmarks (e.g., accuracy, perplexity, or BLEU scores) because of limited public access to evaluation datasets and full model specifications. Instead, it emphasizes structural, linguistic, and governance-level analyses, which complement technical studies.

5. Analysis and Discussion

This section applies an analytical framework to critically evaluate BharatGen along two core dimensions: data sovereignty and language equity. A third subsection addresses the broader ethical and strategic implications of India’s AI ecosystem.

5.1 Data Sovereignty and Digital Autonomy (RQ1)

BharatGen’s development under the Department of Science and Technology and its execution via a consortium of Indian academic institutions and startups reflects a deliberate attempt to reclaim ownership of AI infrastructure and data assets [6,7,9]. The Bharat Data Sagar initiative, which aggregates culturally and linguistically diverse datasets across regions, reinforces this intention by emphasizing the domestication of data sources and reduction of dependency on foreign data pipelines [7,10].

From a governance perspective, BharatGen marks a strategic move toward a strategic AI infrastructure, echoing trends in global AI geopolitics [4,5]. Unlike commercial LLMs controlled by private corporations, BharatGen is publicly funded, guided by national policy frameworks, and aligned with the goals of the INDIAai mission and Digital India [8,9]. However, the current lack of transparency around data licensing, model openness, and computing infrastructure ownership raises questions regarding long-term sovereignty and accountability.

Furthermore, while the initiative centralizes control within Indian institutions, questions remain regarding inclusivity in data governance. Who defines a dataset? What mechanisms exist for participatory oversight? These concerns align with critiques in decolonial AI literature, which warn against simply replacing one center of control with another unless epistemic plurality is prioritized [1–3,12].

5.2 Language Equity in AI Development (RQ2)

One of the most distinguishing features of BharatGen is its explicit focus on linguistic inclusion. The model supports all 22 scheduled Indian languages and builds upon multilingual resources, such as IndicNLP [16], IndicTrans2 [15], and corpora curated by AI4Bharat [17]. This constitutes a substantial improvement over global LLMs, which often neglect or tokenize Indian languages insufficiently [10,11].

Technically, BharatGen’s approach to low-resource language modeling—through instruction tuning, dataset bootstrapping, and regional data partnerships—reflects state-of-the-art practices in equitable NLP [14,15]. However, it also inherits challenges common to this space, including data imbalance, limited benchmark tasks in Indian scripts, and the difficulty of generalizing across dialectal variations [10,11].

While multilingual inclusion is central to BharatGen’s design, linguistic equity also implies equitable use, access, and benefits. Questions arise about whether speakers of marginalized languages will meaningfully interact with BharatGen-powered systems and whether downstream applications (e.g., health and education) will reflect these linguistic priorities or merely showcase tokenistic coverage [13].

5.3 Cultural Alignment and Ethical AI Futures (RQ3)

BharatGen offers an opportunity to reimagine AI not just as a technological platform, but also as a cultural and ethical artifact. By anchoring its training in Indian languages, idioms, and public sector use cases, the initiative moves beyond the model centricity of many Western LLMs and leans toward a socio-contextual AI vision [6,7].

This vision resonates with the principles articulated in indigenous and decolonial AI frameworks, including epistemic diversity, relational accountability, and localized knowledge systems [1,2,12,13]. However, BharatGen does not explicitly cite or engage with this body of research. Thus, there is a gap between alignment in practice and explicit commitment to these values in design governance.

Moreover, the model is developed through a public-private consortium that includes both state actors and major industry players. While this may accelerate deployment and innovation, it also raises concerns about extractive interests, digital inequity, and the risk of centralization under new techno-commercial regimes [5,19].

Ethically, BharatGen’s success depends not only on its technical performance, but also on its ability to remain transparent, participatory, and just. National AI initiatives must grapple with whether their output serves diverse public interests or reinforces dominant narratives under the guise of inclusivity.

6. Conclusion and Future Work

BharatGen represents a landmark initiative in India's AI journey, signaling a strategic and ethical shift toward sovereign, inclusive, and culturally grounded artificial intelligence. As this analysis has shown, BharatGen is more than a foundational multimodal LLM; it is a national experiment in embedding AI development within the linguistic, social, and policy contexts of diverse democracies.

This initiative addresses pressing concerns related to data sovereignty by promoting national control over AI infrastructure, datasets, and institutional partnerships. It simultaneously advances language equity by prioritizing the development and deployment of AI technologies across India's 22 scheduled languages, thereby confronting the exclusionary tendencies of mainstream LLM architecture.

Despite its promise, BharatGen is still in its early stages of development. Key challenges remain regarding data transparency, performance benchmarking, inclusive governance, and equitable deployment. While its alignment with sovereign AI strategies and multilingual NLP best practices is evident, the lack of explicit integration of Indigenous AI ethics and decolonial frameworks reflects a missed opportunity for deeper engagement with justice-oriented AI design.

To ensure the success and scalability of BharatGen, future work must focus on the following:

Transparent governance and accountability mechanisms for dataset curation, model training, and public oversight.
Technical benchmarks and open evaluation frameworks to assess language coverage, bias mitigation, and societal impacts.
Community partnerships involve regional linguistic communities, educators, and civil societies in shaping use cases and interfaces.
Cross-national collaborations to position BharatGen as a model for sovereign AI in other multilingual and post-colonial contexts.
Deeper theoretical engagement with indigenous and critical AI studies to embed ethical considerations in technical roadmaps.

BharatGen is a powerful step toward claiming linguistic agency and technological autonomy in the age of global AI. Its evolution will shape not only India’s digital future, but also offer insights into how nation-states can construct AI systems that are not only intelligent but also inclusive, representative, and just.

7. Policy Recommendations

Several focused policy actions are recommended to help BharatGen achieve its full potential. Thus, transparent governance is essential. An independent oversight body with experts in AI, ethics, and linguistics should supervise data practice, model updates, and deployment oversight. Regular public reporting of datasets, training parameters, and evaluation results would improve accountability and public trust.

Second, community participation should be strengthened. Funding should support regional linguists, translators, and local institutions working in under-represented languages. Involving these communities in data collection and validation ensures both accuracy and inclusivity in the model development.

Third, open access and interoperability should be prioritized. Making parts of the BharatGen open-source encourages research and innovation. Simultaneously, its integration with platforms such as DigiLocker, Bhashini, and UPI requires standardized APIs and multilingual support to ensure usability across services.

Fourth, ethical principles must be embedded in the developmental process. Institutions working on the BharatGen should include training on responsible AI, fairness, and cultural context. National policies should also recognize the value of indigenous knowledge systems and promote digital rights.

Finally, BharatGen can serve as a model for international collaborations. India should explore partnerships with other multilingual nations to co-develop AI systems that focus on the local languages and contexts. Such efforts can support a broader movement toward inclusive sovereign AI across the Global South.

References

Leverhulme Centre for the Future of Intelligence (LCFI). (2021). Decolonising AI Project. University of Cambridge. https://www.lcfi.ac.uk/research/project/decolonising-ai/
Arora, A., Barrett, M., Lee, E., Oborn, E., & Prince, K. (2023). Risk and the future of AI: Algorithmic bias, data colonialism, and marginalization. Information and Organization, 33(3), 100478. 10.1016/j.infoandorg.2023.100478
Mohamed, S., Png, M. T., & Isaac, W. (2020). Decolonial AI: Decolonial theory as sociotechnical foresight in artificial intelligence. Philosophy & Technology, 33, 659-684. 10.1007/s13347-020-00405-8
Digital Realty. (2024). What is sovereign AI and why is it growing in importance? https://www.digitalrealty.com/resources/articles/what-is-sovereign-ai
Observer Research Foundation. (2023). Sovereign Data Strategies: Boosting or Hindering AI Development in India? ORF Analysis. https://www.orfonline.org/expert-speak/sovereign-data-strategies-boosting-or-hindering-ai-development-in-india
Press Information Bureau. (2025, June 2). Union Minister Dr. Jitendra Singh launches 'Bharat Gen' – India’s first indigenously developed AI-based Multimodal LLM for Indian Languages. Government of India. https://pib.gov.in/PressReleseDetail.aspx?PRID=2133312
BharatGen Team. (2025). About BharatGen. BharatGen Official Website. https://bharatgen.tech
NITI Aayog. (2018). National Strategy for Artificial Intelligence #AIForAll. Government of India. [Available via Digital India archives or indiaai.gov.in]
Ministry of Electronics and Information Technology. (2025). INDIAai – AI initiatives and digital governance in India. https://indiaai.gov.in
Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. arXiv preprint arXiv:2004.09095. 10.48550/arXiv.2004.09095
Stanford HAI. (2024). Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts. https://hai.stanford.edu/policy/mind-the-language-gap-mapping-the-challenges-of-llm-development-in-low-resource-language-contexts
Lewis, J. E., Arista, N., Pechawis, A., & Kite, S. (2020). Indigenous Protocol and Artificial Intelligence Position Paper. Indigenous AI Working Group. https://www.indigenous-ai.net
Koc, V. (2025). Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges. arXiv preprint arXiv:2501.11496. 10.48550/arXiv.2501.11496
Moskvoretskii, V., Tupitsa, N., Biemann, C., Horváth, S., Gorbunov, E., & Nikishina, I. (2024). Low-Resource Machine Translation through the Lens of Personalized Federated Learning. arXiv preprint arXiv:2406.12564. 10.48550/arXiv.2406.12564
Gala, J., Chitale, P. A., AK, R., Gumma, V., Doddapaneni, S., Kumar, A., ... & Kunchukuttan, A. (2023). Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages. arXiv preprint arXiv:2305.16307. 10.48550/arXiv.2305.16307
Kakwani, D., Kunchukuttan, A., Golla, S., NC, G., Bhattacharyya, A., Khapra, M. M., & Kumar, P. (2020, November). IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the association for computational linguistics: EMNLP 2020 (pp. 4948-4961). 10.18653/v1/2020.findings-emnlp.445
Choudhury, M., & Bali, K. (2021). Technologies for Indian Languages: Past, Present, and Future. In AI4Bharat Compendium.
Sirsat, M. S., & Ghosh, S. (2022). India's AI Ecosystem: A Policy Perspective. AI and Ethics, 2(3), 455–467.
Tami, M. A., Elhenawy, M., & Ashqar, H. I. (2025). Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends. arXiv preprint arXiv:2504.16134. 10.48550/arXiv.2504.16134
Manche, R., & Myakala, P. K. (2022). Explaining black-box behavior in large language models. International Journal of Computing and Artificial Intelligence, 3(2). 10.33545/27076571.2022.v3.i2a.126

AI Data (computing) large language model

Opinions expressed by DZone contributors are their own.

Related

Trending