(tl;dr) The Global Evaluation Initiative has released a landmark white paper timed to Glocal Evaluation Week 2026. It gives M&E professionals a structured framework for asking better questions about AI, and the answers may surprise you.
Barely a week passes without a new AI tool, guidance note, policy statement, or webinar landing in the inboxes of M&E professionals around the world. The volume is overwhelming, and yet the question that matters most remains frustratingly underexplored: When, exactly, should you use AI in evaluation work, and how do you know whether you are doing it well?
That question is the organizing problem of “Navigating AI and Digitalization in Monitoring and Evaluation,” a new white paper by Douglas Glandon, published with the support of the Global Evaluation Initiative (GEI) at the World Bank. The paper arrives just as the evaluation community gathers for the Glocal Evaluation Week 2026 under the theme “Evaluation, Evidence, and Trust in the Age of AI”, which makes the timing feel less coincidental and more like a deliberate opening argument in a conversation the profession badly needs to have.
This post works through the paper closely, section by section, and connects its ideas to the broader context of the GEW 2026 program. It is written for practitioners who already know the stakes of getting AI adoption right in international development, and who are looking for more than slogans.
Three Problems With How We Are Talking About AI in M&E
Before proposing a framework, the paper takes the time to diagnose what is missing in the current discourse. It identifies three structural weaknesses, each of which distorts the decisions practitioners and institutions are trying to make.
AI as a monolithic category
The most common starting point for organizations has been to develop policies, training programs, or guidance on “AI” as a single category. This is understandable as an initial response to a rapidly evolving landscape. But it obscures differences that are critically important in practice. Computer vision applied to satellite imagery, a predictive machine learning model used for program targeting, and a large language model (LLM) processing qualitative interview transcripts differ fundamentally in the data they require, the skills needed to use them, the ways they can go wrong, the ethical considerations they raise, and the kinds of evaluative questions they can credibly address. Writing guidance on “the strengths and limitations of AI” without making these distinctions is, the paper notes, roughly as useful as guidance on “the strengths and limitations of research methods” without distinguishing between randomized controlled trials and participatory action research.
The dominance of LLMs in current discussions compounds the problem. Because large language models are the most visible AI technology at the moment, the word “AI” in M&E conversations is often functioning as shorthand for this specific, and in several respects quite distinctive, subset of applications. Organizational policies calibrated around LLMs may do well for one slice of the landscape and poorly for most of the rest.
Tool-first rather than need-first orientation
A large share of AI-related content in M&E starts with a technology and works toward applications. This is useful for practitioners who already know they want to use a particular tool. But it is poorly suited to the more common situation: a practitioner facing a specific evidence need who wants to know whether an AI-enabled approach might help, or an institution wondering what capabilities it actually needs to build. Starting with the tool implicitly frames AI as the answer before the question has been fully formulated. This framing effect, the paper argues, “implicitly positions AI as a solution,” which is precisely the wrong starting point for rigorous evaluation thinking. Good evaluation practice has always started from the question rather than the method, and the pace of AI adoption has made the tool-first pull unusually strong.
Fragmented contributions
Existing literature on AI in M&E tends to address one dimension at a time: either the technical applications, or the competency implications, or the ethical dimensions, or the industry dynamics. Each contribution adds value. But in practice, these dimensions converge. A practitioner deciding whether to use NLP for qualitative analysis is simultaneously navigating evidence questions (what can actually be learned?), workflow questions (how does this fit the timeline and process?), and capability questions (does the team have the skills to do this responsibly?). A framework that helps practitioners hold all three dimensions together is more useful than any single-dimension treatment, however sophisticated.
The Three-Lens Framework
The paper’s core contribution is a framework organized around three lenses. Each lens captures a distinct relationship between practitioners and technology, and each starts from where practitioners actually are: what they need to know, what they need to do, and what they need to become. Together, the lenses are designed to support what the paper calls “more informed, context-sensitive decisions about AI in M&E.”
Lens 1: Evidence Needs
“What kinds of evidence do we need, and which types of AI-enabled approaches might help us generate each?”
This lens is epistemic. The practitioner has identified one or more evidence needs and wants to know whether a specific AI-enabled approach might expand what is possible: addressing questions that were previously infeasible, or enabling richer, more timely, or more granular evidence than conventional approaches allow. AI enters the picture only when it offers a credible way to generate evidence that is more valid, more timely, more granular, or more feasible than what conventional approaches can provide. This lens draws on the GEI policy/program cycle, which organizes evidence needs across five stages from situation analysis through to results assessment.
Lens 2: Workflow
“What are the constraints in our evaluative work processes, and could AI-enabled approaches help address them?”
This lens is operational. The practitioner faces constraints in data access, processing capacity, analytical bandwidth, or communication reach, and wants to know whether AI can address them. The paper distinguishes three meaningfully different types of workflow enhancement. Automation covers well-defined, repetitive tasks with minimal interpretive judgment, such as transcription, format conversion, and data cleaning, where AI frees practitioners to focus on higher-order work. Augmentation covers tasks where human interpretation remains essential, but AI improves consistency, coverage, or speed. Enabling previously infeasible processes is the most consequential category: tasks not practically possible without AI, such as continuous real-time monitoring at population scale, or near-simultaneous production of audience-tailored evaluation outputs in multiple languages from a single evidence base.
This third category is frequently overlooked. Practitioners focused only on improving existing workflows may miss the possibility that AI opens up entirely different ways of generating and using evaluative evidence.
Lens 3: Capability
“What do we need to know and have in place to work responsibly in an AI-influenced environment?”
This lens is developmental. Where the first two lenses ask whether a specific AI application serves a specific purpose, this lens asks what practitioners and institutions need to develop to navigate an AI-influenced professional environment. It encompasses technical skills, critical judgment, ethical reasoning, and institutional frameworks. The lens draws on the GEI Evaluation Competency Framework, which identifies five competency domains: professional, technical, managerial, interpersonal, and contextual. The paper’s key argument here is that AI does not require rewriting this framework from scratch. Most of what makes a good M&E professional remains unchanged. AI requires additional, AI-specific competencies layered onto existing foundations.
Critically, the paper argues that critical judgment may matter more than technical skill. As AI tools become more accessible, the professional challenge increasingly shifts from “can I use this tool?” to “should I use it, and can I critically evaluate its outputs?”
Lens 1 in Practice: The Spectrum of Evidence Amenability
One of the most useful analytical moves in the paper is the identification of a spectrum from “highly amenable to AI” to “fundamentally requiring human judgment.” Understanding where a specific evidence need falls on this spectrum is a key practical contribution of the evidence lens.
At the amenable end of the spectrum sit approaches that involve large-scale pattern recognition, integration of heterogeneous data sources, or systematic scanning of large evidence bases. Predictive ML models detecting anomalies in program monitoring data that would be invisible to human analysts reviewing the same datasets are a textbook example. NLP techniques accelerating evidence synthesis by processing volumes of literature beyond what a review team could feasibly read are another.
At the other end sit ways of answering that are fundamentally normative or deliberative. Convening stakeholder judgment on priorities, weighing competing values, or interpreting findings in light of context-specific political and cultural considerations involve a different kind of knowing entirely. These are not simply more complex empirical tasks. They require human judgment about what matters and why, which cannot be derived from data alone regardless of how much data is available. AI tools may play supporting roles, for example by compiling and organizing stakeholder input to structure deliberation, but the core epistemic work remains human.
Between these poles lies a substantial middle ground, and thematic analysis of qualitative interview data is the paper’s primary example. LLMs are increasingly capable of processing and categorizing large volumes of text, and under certain conditions may synthesize information more consistently than human coders working under time pressure. But the validity of such analysis depends on whether the categories and interpretations are meaningful in context, something that requires human evaluative judgment that the LLM cannot provide on its own.
Lens 2 in Practice: Where the Workflow Lens Gets Complicated
The workflow lens maps AI applications across the full task structure of evaluative work: managing the overall process, framing the evaluation, engaging the team, designing the evaluation, conducting data collection and analysis, and reporting and applying findings. Across all these clusters, the paper offers detailed illustrative mappings, but several considerations deserve particular attention.
First, the paper introduces the concept of human-in-the-loop design as a workflow question, not just a governance principle. Effective AI-augmented practice depends on defining structured points in the workflow where human input shapes AI processing, where AI outputs are reviewed before being acted upon, or where iterative exchange between the two improves the quality of both contributions. Getting these interaction points right is a workflow design challenge first, and a capability challenge second. The two are inseparable.
Second, the paper is unusually direct about the uneven distribution of workflow benefits. AI tools perform less reliably in languages that are poorly represented in their training data. Applications that depend on structured digital data only work where that data infrastructure already exists. And while AI can help under-resourced teams do more with less, those same teams may lack the expertise and institutional safeguards to catch errors in AI-generated outputs. The efficiency gains tend to accrue to those already well-resourced, while the risks concentrate where capacity is thinnest. This equity dimension is not mentioned as a caveat; it is named as a structural feature of the current landscape that requires deliberate counteraction.
Third, the paper notes a category of new dependencies that AI-enabled workflows introduce: dependencies on connectivity, commercial platforms, proprietary data formats, and specialized maintenance skills. These are not self-managing. They require anticipation, risk assessment, and institutional planning of a kind that many M&E teams are not yet doing systematically.
Lens 3 in Practice: Sequencing, Readiness, and the Equity Problem
The capability lens is perhaps the most practically consequential, particularly for organizations working in international development contexts where M&E system strengthening is often still underway. The paper introduces the concept of sequencing as a central concern: not every institution is ready to deploy a particular type of AI application, and introducing AI prematurely risks wasting resources, producing poor-quality outputs, and eroding trust in both AI and evaluation as a practice.
This is a message that the international development community needs to hear clearly. There is institutional pressure, often driven by donor enthusiasm and reporting requirements, to demonstrate AI adoption. The capability lens provides the analytical basis for a different kind of argument: that the right answer to “are you using AI in your evaluations?” may well be “not yet, because we are still building the foundational capabilities that would make responsible use possible.” Pilot projects with clearly bounded scope can serve as a legitimate entry point, surfacing specific capability gaps without committing to widespread deployment before the prerequisites are in place.
The paper also addresses the governance duality that arises with accessible tools like LLMs. Because these tools are available to individuals without institutional approval, bottom-up adoption is already happening in most organizations. Governance policies need to manage this duality: enabling responsible individual use while maintaining organizational safeguards around information security, data privacy, and methodological integrity. This is named as among the most pressing institutional needs in the current moment.
The equity argument returns here with additional force. Access to AI tools, training, and infrastructure is unevenly distributed across organizations and countries. Without deliberate attention, AI adoption risks widening existing professional inequalities rather than narrowing them. The paper does not resolve this tension, but it names it in terms strong enough to prevent the reader from treating it as a marginal concern.
Why Integration Is the Whole Point
Each lens is genuinely useful on its own. But the paper’s central claim is that the lenses are interdependent in practice, and that any substantive decision about AI in M&E simultaneously involves evidence, workflow, and capability considerations, whether or not the decision-maker is aware of this. The framework’s value lies in making this interdependence visible and structured, so that practitioners can attend to each dimension deliberately rather than discovering gaps after the fact.
The paper illustrates this with two detailed examples drawn from its Appendix D, which provides full three-lens assessments for four AI application profiles. Both examples are worth dwelling on at length.
LLMs for drafting evaluation reports
Through the workflow lens alone, this application looks highly attractive: significant time savings, widely accessible tools, apparently low capability requirements. Through the evidence needs lens, the picture becomes more complicated. An LLM synthesizing a large body of evaluation findings could, in principle, identify patterns or connections across the evidence base that a time-pressed human author might miss. But LLMs can also flatten nuance, fabricate plausible-sounding content, and strip away the contextual judgment that makes evaluation findings meaningful. The evidence value is described in the paper as “limited but potentially non-trivial.”
The capability lens surfaces what is arguably the most important risk: evaluators need the judgment to distinguish when AI-generated text accurately reflects the underlying evidence and when it subtly distorts it. This is a competency that is difficult to develop and dangerously easy to overestimate. The integrated assessment concludes: exercise caution. Adoption is often driven by time pressure rather than evaluative value. High risk of undermining credibility without robust human-in-the-loop review processes. Adopt only with strong verification protocols, evaluators capable of detecting subtle distortions, and transparent disclosure of AI’s role.
Nighttime light satellite imagery for electrification programs
Through the evidence needs lens, this application is potentially transformative. Satellite sensors capture nighttime luminosity data that correlates strongly with electricity access and economic activity, enabling measurement at a geographic resolution and temporal frequency that household surveys cannot match. The paper notes that this same data can serve situation analysis, implementation monitoring, and outcome evaluation simultaneously, which is unusual. Through the workflow lens, it may replace or supplement expensive field-based data collection with scalable, consistent, repeated observation across wide geographic areas and extended time periods.
The capability lens then introduces the constraints that make the difference between transformative and misleading. Interpreting this data requires understanding what satellite sensors can and cannot detect: light saturation in bright areas, insensitivity to economic activity that does not produce light, the need to validate against local survey or administrative data. Ground-truthing is a non-trivial undertaking requiring both technical expertise and local data access. Without it, impressive imagery can create false precision. The integrated assessment concludes: strategic investment is warranted. Evidence and workflow value is high, but realizing it requires deliberate capability building in geospatial analysis, validation methods, and interdisciplinary team composition.
“The challenge is not to become technologists, but to remain evaluators. The same capacities that define good evaluation, rigorous thinking, ethical commitment, contextual sensitivity, methodological pluralism, are precisely those needed to engage with AI responsibly.”
What the two examples together demonstrate is that the integrated picture is often not what any single lens would predict. An application that looks routine through one lens may be transformative through another, or may present risks that are invisible until the third lens is applied. This is the analytical case for treating the three lenses as a system rather than a checklist.
The Four AI Application Profiles Compared
Appendix D of the paper provides detailed three-lens assessments for four application profiles. The table below synthesizes the key dimensions and the integrated implication from each assessment.
| Application | Evidence Value | Workflow Impact | Capability Required | Integrated Implication |
|---|---|---|---|---|
| Nighttime light satellite imagery Electrification / economic activity proxy |
High | High | High | Strategic investment with deliberate capability building. Without ground-truthing, creates false precision. |
| ML for heterogeneous treatment effects Causal forests in RCTs |
High | Moderate | High | Strategic investment with technical partnerships. Most teams need specialist collaboration. |
| NLP for qualitative data analysis Interview transcripts / coding |
Moderate | Uncertain | Moderate to High | Invest in capability with clear guardrails. Useful as augmentation; risky as replacement. Requires independent human review of AI-coded subsamples. |
| LLMs for drafting evaluation reports Synthesis and summarization |
Limited | Moderate | Deceptively Low | Exercise caution. Adoption driven by time pressure, not evaluative value. Adopt only with strong verification protocols and transparent disclosure. |
The “deceptively low” capability rating for LLMs is one of the most important observations in the paper. The tools are easy to use, which creates the impression that the competency requirements are low. In practice, the hardest part is judging when AI-generated text accurately reflects the underlying evidence and when it subtly distorts it. This judgment requires deep familiarity with the evidence base, methodological confidence, and the ability to detect what is absent rather than merely what is present. These are high-level evaluative competencies, not basic skills.
Ethics?
One of the paper’s most distinctive structural choices is the treatment of ethics. Rather than assigning ethics its own section, the paper distributes ethical considerations across all three lenses, arguing that ethics manifests differently depending on which aspect of AI use is under consideration. This is a more demanding position than dedicating a chapter to “AI ethics,” because it requires practitioners to identify the specific ethical questions that arise in their particular situation rather than applying generic principles.
Through the evidence needs lens, ethical questions center on validity and fairness. Could biases in training data lead to conclusions that misrepresent particular groups? Are the limitations of AI-generated evidence transparently communicated to users? These are questions about what the evidence can credibly claim, and who it may inadvertently harm through selective representation.
Through the workflow lens, ethical questions center on quality, consent, and the distribution of benefits. Does automation maintain quality standards? Are data subjects informed about AI processing of their personal information? Do efficiency gains accrue primarily to evaluators while risks fall on those being evaluated? This last question deserves particular attention in development contexts, where the power asymmetry between evaluators and program beneficiaries is already significant.
Through the capability lens, ethical questions center on professional responsibility and institutional accountability. Are practitioners deploying tools they cannot critically evaluate? Do institutions have the governance mechanisms to ensure meaningful human oversight? The paper names the risk of practitioners using AI tools beyond their competence level as an ethical issue, not merely a quality issue. This is a meaningful distinction. It places professional responsibility for AI outputs squarely with the practitioner, rather than with the technology.
From Individual Evaluator to National M&E System
A feature of the framework that may not be immediately apparent from the description of the three lenses is its multi-level applicability. Each lens can be applied at the individual practitioner level, the organizational level, and the national M&E system level, and the questions it generates are different at each level while the structure remains the same.
At the individual level, the questions are practical: What evidence does this evaluation need? What process constraints am I facing? Do I have the skills? At the organizational level, the questions become strategic: What evidence does our portfolio need? Where are systemic process constraints? What institutional capabilities do we need to build? At the national system level, the questions become systemic: What policy questions could AI help the system answer? Where are system-level bottlenecks? What system-wide readiness conditions are needed, including whether foundational M&E capacities should be strengthened before AI applications are introduced?
This multi-level applicability is particularly important for organizations working on evaluation capacity development, which typically operate simultaneously at the individual, organizational, and system levels. The framework provides a consistent structure for analysis across all three levels, which makes it more useful for capacity development planning than tools designed primarily for individual practitioners.
The paper explicitly flags the national M&E system level as an area where caution about premature AI adoption is most important. In settings where foundational M&E capacities are still being established, introducing AI applications before the prerequisites are in place risks producing unreliable outputs, wasting scarce resources, and eroding trust in both AI and evaluation as a professional practice. This argument has direct implications for how development partners frame AI-related support to national evaluation systems.
Glocal Evaluation Week 2026: Why This Moment Matters
The Glocal Evaluation Week is not a conference in the conventional sense. It is a globally distributed event in which individuals and organizations self-organize sessions on topics they care about, free of charge and open to everyone. Over 300 events are being held worldwide this year, ranging from brief webinars to full-day workshops, in every format and timezone. The decentralized, community-driven nature of the GEW is itself a statement about the kind of knowledge production the evaluation field values: plural, participatory, and not controlled by any single institution.
The choice of theme for 2026, “Evaluation, Evidence, and Trust in the Age of AI,” reflects a maturation in the field’s engagement with this topic. Two years ago, the dominant conversation was about possibilities and risks in the abstract. The conversation has shifted to more concrete and operational territory: how do you actually maintain the trust that evaluation depends on when the tools and methods are changing faster than the professional norms that govern them?
The GEI paper fits directly into this conversation. Its insistence on need-first rather than tool-first thinking, its attention to the equity dimensions of AI adoption, and its distributed treatment of ethics across all three lenses are all consistent with a profession trying to navigate a genuinely difficult transition without losing what makes evaluation valuable in the first place.
Sessions worth attending
The full program is available at glocalevalweek.org/glocal-events with over 300 sessions to explore. The following are particularly relevant to the themes of this post:
-
3 June: Rethinking Evaluation Competencies for the Transformational ImperativeInternational Evaluation Academy · Online, free
-
3 June: Doing More with Less while Not Losing Trust: How Should Evaluation Standards Evolve in an AI Augmented World?EvalforEarth · Online · 15:00 CEST
-
4 June: AI and Evaluation in Complex Contexts: Climate Resilience, Disaster Response and Humanitarian ActionOnline, free
-
5 June: Critical Approaches to AI from African and Indigenous PerspectivesOnline, free
What This Means for Capacity Development Work
For practitioners working on evaluation capacity development, the three-lens framework is more than a decision-support tool for individual evaluations. It is a diagnostic instrument for understanding what a partner institution actually needs before AI-related support is designed.
The standard framing of capacity development support in this area has been to offer training on AI tools or, at a more sophisticated level, to develop organizational AI policies. The framework suggests a more differentiated approach. Before training on tools, does the institution have the data infrastructure on which any AI application depends? Before developing governance policies, does it have the institutional learning processes needed to apply those policies sensibly? Before investing in advanced analytical capabilities such as ML for heterogeneous treatment effects, are the foundational M&E competencies in place that make it possible to interpret and communicate results responsibly?
The sequencing perspective the paper introduces is particularly important here. Capacity development support that introduces AI applications ahead of the foundational prerequisites does not accelerate transformation. It risks producing unreliable outputs that erode institutional trust in evaluation, which is the opposite of what the field is trying to build. The framework provides a principled basis for designing support that builds in the right sequence, rather than defaulting to the most visible or donor-attractive applications.
There is also a more immediate application for consultants and evaluators working with partner organizations on specific evaluations. The three lenses offer a structured way to frame the conversation about AI: starting not with “should we use AI?” but with “what evidence do we need, what constraints do we face, and do we have what it takes to use AI responsibly for this specific purpose?” These are questions that can be asked in a client workshop, an inception meeting, or a terms of reference review. They generate more useful answers than a generic discussion of AI pros and cons.
What the Paper Leaves Open
The paper is honest about its limitations. It describes the institutional capability section as “outlin[ing] institutional capability dimensions without constituting a full institutional capability framework,” and flags the development of a comprehensive, validated framework for institutional AI readiness in M&E as a substantial undertaking for future work. Similarly, the proposed AI/digitalization readiness module within the GEI’s M&E Systems Analysis (MESA) framework is described as under active exploration but not yet available.
The paper also notes that the capability lens competency mapping has not been extended to other professional roles in M&E systems: policymakers, evaluation commissioners, and senior officials who require AI literacy competencies for oversight, accountability, and informed commissioning. This is an important gap. The decisions that most shape how AI is used in national M&E systems are often made by people who are not evaluators themselves, and who therefore fall outside the current scope of the competency framework.
More broadly, the paper is a starting point rather than a finished product, and it explicitly positions itself this way. The concluding invitation is for the global M&E and evaluation community to “apply, test, challenge, and refine” the framework. This is the right posture. A framework that has been tested against the diversity of contexts, resource levels, and institutional arrangements in which evaluation is practiced internationally will be considerably more robust than one developed primarily within a single institutional context, however well-resourced. The GEW 2026 represents exactly the kind of global, decentralized testing ground where that development can happen.
Reading This Paper at the Right Moment
The GEI white paper arrives at a moment when the evaluation field is at genuine risk of splitting into two camps: enthusiastic adopters who treat AI as a straightforward productivity gain, and skeptics who treat it as a threat to the rigor and integrity that make evaluation worth doing. The three-lens framework is a sophisticated attempt to make this binary obsolete. It does not tell practitioners whether to use AI. It gives them a structured way to ask better questions about when, for what purpose, under what conditions, and with what safeguards.
That is a more useful contribution than either cheerleading or warning. The profession already has enough of both.
The Glocal Evaluation Week 2026 is a week-long global argument about exactly these questions, distributed across dozens of countries and hundreds of practitioners. The paper is the most substantive single contribution to that argument published in time for the week. Reading it before or alongside the GEW sessions will make both more productive.