Generative AI Evaluation in the Development Sector

November 2025

Disclaimers

Authors:

The Agency Fund: Zezhen Wu, Robert On, James Walsh, Edmund Korley, Temina Madon, Linus Wong
IDInsight: Sid Ravinutala
Center for Global Development: Han Sheng Chia, Markus Goldstein

This playbook is an updated version of the AI Evaluation in the Social Sector Playbook created by The Agency Fund (see original authors and contributors here), with input from IDInsight and CGD.

The source code for this playbook is available on GitHub.

Tip: Click on “Save as PDF” to save this presentation as a PDF file.

Introduction

The Challenge

Generative AI tools in low- and middle-income countries are multiplying:
- AI-powered math tutors for children
- Digital advisory tools for farmers
- Health assistants
Problem: While some studies show effectiveness (e.g., Henkel et al., 2024), others show AI applications exhibit unexpected and unwanted behavior that can be harmful to users (e.g., Bastani et al., 2024)
Gap: While there is broad consensus on the importance of evaluating GenAI in the development sector, there has been little agreement on what this actually means
Consequence: In the absence of clear standards, organizations have adopted very different evaluation approaches

The Problem: Fragmented Approaches

Tech-focused organizations: Emphasize model/product performance, neglect development outcomes
Development-focused organizations: Default to RCTs, ignore model and product evaluations
Funders: Lack clarity on what evaluations to expect or what “right-sized” evaluation entails
Reality: All methods are complementary and should be used together at different stages

The Framework

The Solution: Four-Level Framework

This playbook organizes evaluation around four levels:

Level 1 – Model evaluation: Does the AI model produce the desired responses?
Level 2 – Product evaluation: Does the product facilitate meaningful interactions?
Level 3 – User evaluation: Does the product positively support users’ thoughts, feelings, knowledge, and behaviors?
Level 4 – Impact evaluation: Does access to the product improve development outcomes?

Continuous Evaluation: The Central Element

Unlike earlier rule-based digital tools, GenAI’s unique sensitivity to the underlying model, architecture, data, and prompts demands new evaluation methods
The underlying components can evolve far faster than in earlier digital technologies - with new AI models and technologies being released weekly
Developers must ensure their applications perform as intended over time, even as updates are released
Continuous evaluation thus becomes essential, enabling developers to:
- Iterate quickly
- Maintain expected behavior
- Steadily improve performance and impact
This focus on continuous evaluation, while commonplace in software companies, might be less familiar in the development sector where programs are often judged by one-off experiments

The Development Sector Challenge

Development sector: Programs often judged by one-off experiments (evaluation as finish line)
Our approach: Rapid, ongoing cycles where deployment, adaptation, evaluation, and improvement happen in tandem
This is commonplace in software companies, but less familiar in development sector

Framework Origins

Lessons from AI for Global Development (AI4GD) accelerator
- Led by The Agency Fund (TAF)
- Collaboration with OpenAI and Center for Global Development (CGD)
- $5 million investment supporting 8 non-profit organizations
- Focus: Education, health, and agricultural livelihoods
Incorporates lessons from AI4GD accelerator teams and other leading organizations

Natural Progression of Evaluation Levels

Level 1: Build AI model pipeline, benchmark performance
Level 2: Test product for usability and user engagement
Level 3: Understand effects on thoughts, feelings, knowledge, behaviors
Level 4: Assess long-term improvements in development outcomes

Tech sector: Typically stops at Levels 1-2 (engagement predicts success)

Development sector: Higher bar - does it improve lives in meaningful, measurable, cost-effective ways?

Who Should Use This Guide?

AI Engineers → Model behavior (Level 1)
Product Managers & Data Scientists → Product analytics (Level 2)
Psychologists & Behavioral Researchers → User thoughts, feelings, behaviors (Level 3)
Impact Evaluators → Social impact (Level 4)

Key: All actors must see beyond their slice of the evaluation process

Building Blocks for GenAI Evaluation

Four concrete, actionable steps that move evaluation from theory into practice:

Construct a clear user funnel across Levels 1-4
Build and track metrics through robust ETL pipelines
Diagnose weak links through targeted hypotheses
Run experiments with rigor and speed

Building Block 1: User Funnel

A user funnel is a structured way to map how individuals move through your product or program, from first exposure to long-term life impact. A comprehensive funnel creates a shared framework for tracking a user’s experience through a journey.

To build a robust funnel, teams should begin by defining the final development outcome they’re targeting (Level 4) - for instance, improved learning outcomes, better health, or increased crop yields. From there, work backward to break down the journey into specific user stages.

User Funnel: The 6 Stages

Stage	Description	Evaluation Level
1. Recruitment	Beneficiary identified and enters program	Level 2
2. Onboarding	User introduced to AI product and completes setup	Level 2
3. Engagement	User begins actively interacting with AI product	Level 2
4. Retention	User continues engaging over time (not dropping off)	Level 2
5. Proximal Outcome	Near-term cognitive or behavioral change	Level 3
6. Development Outcome	Long-term desired result achieved	Level 4

User Funnel: What to Define for Each Stage

For each stage, teams should clearly define:

Element	Description
What program does	Actions to bring users into that stage
What user must do	User actions that count as entering the stage
Metric	Measurement that confirms entry (e.g., login rate, session length)
Target values	Target metric values and transition rates between stages
Costs	Costs associated with moving a user through the stage
DRIs	Directly Responsible Individuals for performance and metrics

Building Block 2: ETL Pipelines

A well-designed evaluation framework is only as good as the data infrastructure that supports it. At the heart of that infrastructure is a robust ETL pipeline - a system that extracts, transforms, and loads data to power consistent, reliable measurement of program indicators.

Extract: Collect data from various sources - chat logs, product telemetry, survey tools, third-party APIs, or spreadsheets
Transform: Clean, standardize, and reshape the raw data into a usable format. This could involve timestamp alignment, anonymization, session stitching, or deriving new metrics and indicators
Load: Store the transformed data in a centralized system (like a data warehouse or analytics dashboard) where teams can access it for analysis, visualization, or modeling

Why critical: AI products, especially those using generative models, produce high volumes of complex, often unstructured data: prompts, outputs, clicks, feedback, engagement patterns, and more. Without a clear ETL pipeline, turning raw data into actionable metrics at scale becomes unreliable and slow.

Example: A product designed to support adolescent mental health might collect model-level outputs (Level 1), engagement logs (Level 2), behavioral indicators (Level 3), and outcome data (Level 4) - all requiring integration through a robust pipeline.

Building Block 3: Targeted Hypotheses

Once a user funnel is in place and metrics are flowing through a robust ETL pipeline, the next challenge is understanding why certain funnel metrics are underperforming.

Process:

Identify drop-offs: Start by identifying major user drop-offs along the funnel
Develop hypotheses: Pose specific, testable questions: Why are users stalling? What mechanism explains this?
Surface competing hypotheses: For example, if engagement dips after onboarding: Is value proposition unclear? Are users overwhelmed? Do they mistrust the AI?
Test hypotheses: Each hypothesis becomes a lens for focused measurement or experiments

Goal: Make evaluation generative - helps teams ask better questions, faster. This approach sits at the intersection of product management, UX research, and behavioral science.

Building Block 4: Experiments

Test hypotheses through experimentation:

A/B tests: For lightweight product tweaks (prompts, messages)
Hold-out tests: For more complex behavioral shifts
Pragmatic RCTs: For policy-relevant questions
Full RCTs: When causal question justifies cost

Key: Match experimentation to product maturity, hypothesis scale, and decision stakes. Tools like Evidential help teams automate randomization and track results.

Building Cross-Functional Teams

Evaluation is a team sport - no single role covers all four levels.

Level	Lead Roles	Support Roles
Level 1	AI Engineers, ML Researchers	Domain Experts, Product Owners
Level 2	Product Managers	Data Scientists, Data Engineers
Level 3	Psychologists, UX Researchers	Data Scientists
Level 4	Policy Researchers, Economists	AI Engineers

Best Practices for Cross-Functional Teams

Look Beyond Your Slices: Understand how your work affects other levels
Pair Engineers with Domain Experts Early: Involve domain experts at Level 1 from start
Identify Cross-Functional Lead: Product Managers connect dots across roles
Use Shared Evaluation Language: Common vocabulary using levels
Use Tools that Support Collaboration: Evaluation platforms, dashboards, experimentation tools

Level 1: Model Evaluation

Question: Does the AI model produce the desired responses?

Why important: AI models, especially large language models (LLMs) and related foundational models, do not “understand” content in the way humans do. Instead, they generate outputs by predicting the next word in a sequence based on statistical patterns in their training data. Because of this, models can hallucinate or appear fluent and convincing while still being inaccurate, biased, irrelevant, or even harmful.

This makes structured model evaluation essential. We need to systematically and rigorously assess whether an AI system consistently meets conditions such as usefulness, accuracy, appropriateness, and safety across diverse tasks and user contexts. This is especially critical when AI tools are deployed in sensitive domains like education, health, or agriculture, where misinformation or misalignment can cause real harm.

Beyond ensuring safety, developers must also evaluate that their AI systems exhibit desirable behaviors and characteristics proven to have a real-world impact. For instance, an AI tutor should follow pedagogical best practices - such as withholding answers to encourage self-directed learning and accurately gauging a student’s level to tailor instruction.

What is Being Assessed?

Most Generative AI applications are built on foundational models like those from OpenAI (GPT), Anthropic (Claude), Google (Gemini), or Meta (Llama). However, your application is a full system, not just the foundational model. It includes many other components that can be grouped into three buckets:

Pre-processing: Before handing off the input from the user to the LLM, you may wish to transform it into a format suitable for the LLM. Examples include: sanitizing or filtering language; converting speech to text; paraphrasing the user’s request; translation from a low-resource language to a high-resource one.
LLM context preparation: An LLM takes three things as input: the “prompt” or system instructions, the user’s input after being pre-processed, and a “context” which can include past conversation history, relevant content retrieved from your knowledge base, or even tools available for the LLM to call.
Post-processing: Before returning the output to the user, you may also wish to transform it into the correct format and check the output using safety or quality guardrails. Examples include: hallucination checker, converting text to speech, translation to the user’s preferred language.

What is Being Assessed?

Example: An AI agronomist in Senegal, answering questions from farmers in Pulaar, might: (a) check input for malicious intent, (b) translate from Pulaar to English, (c) retrieve relevant content from database, (d) retrieve information about the farmer, (e) generate an answer, (f) check that the answer is grounded, (g) translate back to Pulaar. Model evaluations cover this entire pipeline.

Who is Most Involved?

Role	Responsibility
AI Engineers, ML Researchers	Execute - Lead model evaluation process
Domain Experts, Product Owners	Support - Define evaluation rubrics

How to Evaluate: 6 Steps

Decide on rubric: What characteristics should the solution embody?
Decide on metrics: How to measure performance against rubric
Develop golden dataset: Representative user interactions
Run automated evaluations: Integrate into CI/CD pipeline
Measure scores and refine: Use scores to improve solution
Red-teaming: Actively try to break/pressure test before release

Step 1: Decide on Rubric

Responsible: Product Owners and Domain Experts (with Engineering support)

Question: “What characteristics should our AI solution embody?” These are qualitative goals (e.g., “Trustworthy”, “On-Brand”, “Concise”). Most of the rubric will be determined by your use case, context, and impact goals. This step requires reflection and discussion with stakeholders - it is critical and guides the rest of your evaluation steps.

Step 1: Rubric Examples

Organization	Product	Rubric Items
Jacaranda Health	PROMPTS: Maternal health SMS service (Swahili/English)	Medical accuracy, personability, simplicity (Stanford Center for Digital Health, 2025)
Digital Green	Farmer.Chat: Agricultural advice platform (40+ crops, 4 countries)	Faithfulness, relevance, accessibility (Singh et al., 2024)

Recommendation: Restrict to ~5 items. Longer lists = more expensive and difficult. Tradeoffs exist (e.g., concise vs. complete, friendly vs. direct).

Step 2: Decide on Metrics

Responsible: Engineering (with Product Owner validation)

Engineering translates qualitative rubric into quantitative metrics (e.g., “Trustworthy” → “Factual Consistency Score”). Product Owner validates that technical metrics are acceptable proxies for business goals.

Terminology: Rubric item → Metric → Scorer → Score
Example: “helpful” → “answer relevance” → “LLM-as-judge” → “4 out of 5”

Step 2: Scorer Categories

Category	Examples	Speed	Accuracy	Cost	Best For
Statistical	BLEU, ROUGE, METEOR, WER	+++++	++	+	Specific tasks
Model-based	AlignScore/LIM-RA, BLEURT, BARTScore, COMET	+++	+++	+++	Domain-specific tasks
LLM-as-judge	G-Eval, RARR	++	+++++	+++++	Flexible evaluation
Human evaluation	Human evaluation	+	+++++	++++++	Calibration & QA

Ideal: Combination of methods. Human evaluation’s primary role: create “answer key” to calibrate automated scorers and final QA. Note: Human evaluation has its own biases.

Step 3: Develop Golden Dataset

Responsible: Product Owner (with Domain Experts and Engineering support)

Product Owner ensures quality, scope, and representativeness. Domain Experts author ideal answers. Engineering provides technical support.

Step 3: Develop Golden Dataset

Source	When to Use	Notes
Past transaction data	Adding AI to existing application	Extract question-answer pairs from human-answered queries
Human-annotated data	Building new AI offering	Generate questions + expert answers. Warning: Don’t use LLM to generate answers for experts to verify - correcting is harder than creating
Customize public datasets	High-quality public dataset exists	Subset and augment to match your context

Step 3: Develop Golden Dataset

A good dataset should cover:

Types of questions: Not just what, but also how (tone, language, code-switching, informality, spelling errors, verbosity)
Out-of-context requests: Questions your application doesn’t support
Adversarial requests: Abusive input, prompt injection, jailbreaking, data & privacy attacks

Step 4: Run Automated Evaluations

Responsible: Engineering (with Product Owner support)

Automate evaluations and integrate into CI/CD pipeline. Product Owner monitors performance trends over time.

Step 4: Run Automated Evaluations

Eval Type	Examples	Frequency	When
Low-Cost	Statistical scorers (ROUGE), model-based	Every commit	Fast feedback, limited scope
High-Cost	LLM-as-judge scorers	Nightly/weekly/before release	Comprehensive but expensive

Tracking: Use observability tools (Logfire, Helicone, Langfuse) to plot metric scores over time. Dashboard helps track progress against rubric goals.

Step 5: Measure Scores and Refine

Responsible: Engineering (with Product Owner support)

Initial scores will reveal areas for improvement. Results are a diagnostic tool, not a final grade. Engineering analyzes results and diagnoses root causes. Product Owner prioritizes refinement work.

Step 5: Measure Scores and Refine

Step	Action	Purpose
1. Isolate Problem	Identify which component is failing	Modern AI has many components (retrieval, prompting, model params)
2. Use Traces	Inspect inputs/outputs of each component	Pinpoint root cause (e.g., ineffective retrieval, poor prompt)
3. Unit Tests	Implement component-level tests	Validate specific logic, catch regressions early

Goal: Turn evaluation into a process that helps teams ask better questions and improve iteratively.

Step 6: Red-Teaming

Responsible: Product Owner (with Engineering, QA, or Security/Ethics Team support)

Product Owner ensures red-teaming is conducted and prioritizes remediation. Technical teams execute adversarial testing.

What is red-teaming? Structured adversarial testing to proactively discover vulnerabilities, biases, and failure modes before users do. Think like a malicious actor, confused user, or edge-case generator.

Step 6: Red-Teaming

Scenario	Why Critical
Agentic/Flexible Solutions	More pathways for failure (web browsing, code execution, multi-step decisions)
Long Conversation Histories	Cumulative errors - small issue in turn 1 amplified by turn 10
High-Risk Domains	Maternal health, medical advice, financial planning - severe impact of failure
Population-Scale	Unknown interaction patterns; improbable behaviors will occur at scale

Step 6: Red-Teaming

Plan → Probe → Prioritize

Plan: Define goals, choose testers, decide where to test
Probe: Adopt adversarial mindset, try to confuse/exploit/stress-test, capture failures
Prioritize: Review by severity/likelihood, assign owners, rerun after fixes

Resources: Red-Teaming AI for Social Good Playbook (UNESCO & Human Intelligence, 2024), Planning Red-Teaming for Large Language Models (Microsoft Learn, 2024)

Level 2: Product Evaluation

Question: Does the product facilitate meaningful interactions?

Why important: Beyond evaluating how the AI model performs against key metrics, organizations need to assess how well the product engages real users and whether it solves a meaningful problem for the user. It is unlikely that a product will shift development outcomes if it fails to engage its users. Like model evaluation, this type of evaluation is a continuous and iterative process, rather than one-off.

Technology companies frequently evaluate and improve products by collecting user interaction metrics and then running rapid cycles of digital experiments. For example, they may track a user’s journey on a website, automatically collecting records like which products users click on and whether they return to the site. Then, they can compare how different web or app experiences affect browsing time or user satisfaction.

Unique advantages of digital interventions: This rapid, iterative process is enabled by two factors unique to digital interventions: (1) iterations of the product can be precisely and efficiently deployed to different users, and (2) on-platform engagement outcomes are costless to collect and transform into meaningful engagement metrics.

Who is Most Involved?

Role	Responsibility
Product Managers	Execute - Directly responsible for product metrics
Data Scientists	Support - Apply evaluation methods
Engineers	Support - Build and roll out features

Evaluation Methods

A/B testing: Feature A vs. Feature B
Multi-armed bandit: Performance-based adaptive allocations
Holdout testing: AI vs. non-AI; status quo vs. accumulated improvements

Resources: AI4GD A/B Testing Playbook (Work in Progress), AI4GD User Funnel and Metrics Playbook

Measurement Tools: Categories

Category	Metric Type	Examples
Retention	User-Level Retention	DAU/MAU, session count
Engagement	Action-Based	Response rate, clicks, rewrites
Engagement	Interaction Duration	Session length, conversational turns
Engagement	Feature Uptake	Click-through to links, feature use
Non-Engagement	Quality Scores	Toxicity score, informativeness
Non-Engagement	Item-Level Surveys	“Helpful” ratings, “want more” votes
Non-Engagement	User-Level Surveys	Overall satisfaction, usability
Non-Engagement	User Control	Topic subscriptions, filtering

Level 3: User Evaluation

Question: Does the product positively support users’ thoughts, feelings, knowledge, and behaviors?

Why important: Once product functions correctly (Level 1) and engages users (Level 2), ask:

Is it changing how users think, feel, or act?

User psychological and behavioral changes are early indicators of long-term development goals. These evaluations are faster and cheaper than full impact evaluations, allowing rapid iteration.

Level 3: Focus Areas

Area	Question	Example Constructs
Cognitive	Are users gaining new knowledge or correcting misconceptions? Do they demonstrate improved skills or decision-making ability as a result of using the AI?	Users’ comprehension, reflection, reasoning, and perceived clarity or understanding during interaction
Affective	How does the product make users feel? Do users report feeling supported, motivated, and capable after interactions, or are there indications of frustration, confusion, or emotional distress?	Mood, sense of belonging, perceived empathy, trust, or comfort interacting with AI
Behavioral	Are users taking small but meaningful actions (e.g., asking more questions, trying out recommended behaviors) that would predict their longer-term development?	Users’ acquisition, recall, and application of factual or procedural information, and observable behaviors (e.g., asking more questions, trying out recommended behaviors) that are proxies for longer-term development outcomes

Who is Most Involved?

Role	Responsibility
Psychologists, UX Researchers	Execute - Apply evaluation methods
Data Scientists	Support - Design A/B tests and experiments

Evaluation Methods

A/B testing: Feature A vs. Feature B
Multi-armed bandit: Performance-based adaptive allocations
Holdout testing: AI vs. non-AI

Measurement Tools: 4 Categories

On-Platform Behavioral Measures: e.g., Frequency and depth of queries, changes in language complexity, follow-up question rate, session duration & return rate, feature utilization and AI suggestions followed, feature utilization and AI suggestions followed
Short Self-Report Surveys: Validated scales, brief and specific, integrated into flow
NLP and Text Analysis: Sentiment analysis, topic modeling, LIWC, LLM-based analysis
Off-Platform Measures: Longer surveys, observer reports, objective performance data

On-Platform Behavioral Measures: Examples

Metric	What It Measures	Why It Matters	Example Indicators
Frequency & Depth of Queries	How often users ask questions and how advanced their questions become	Higher frequency signals curiosity; deeper, more specific questions signal knowledge growth and confidence	User moves from “What is X?” → “How does X compare to Y in scenario Z?”; increased number of daily queries
Language Complexity	Growth in vocabulary, syntax, and sophistication of user language	More complex language reflects cognitive development and mastery of subject matter	Longer messages, domain-specific terminology, more complex sentence structures
Follow-Up Question Rate	How often users ask a related follow-up after receiving an answer	Indicates sustained engagement, deeper thinking, and an inquiry mindset	“That makes sense—what about ___?”; more dialogue turns per query; rising follow-up ratio
Session Duration & Return Rate	Time spent per session and frequency of returning to the app	Longer voluntary sessions and higher return rates show motivation, value perception, and learning persistence	Session length increasing week-over-week; user returns 4x/week instead of 1x

Short Self-Report Surveys: Examples

Often, the most direct way to gauge a user’s thoughts, feelings, knowledge, and behaviors is simply to ask them. Short surveys can capture self-reported changes and subjective experience. When developing such measures in an AI product, a few guidelines are important:

Direct measurement tool for users’ thoughts, feelings, knowledge, and behaviors.
Use validated items when possible (e.g., self-efficacy, motivation, emotional state).
- Examples: “I learned something new today,” “I feel more confident,” “This chat motivated me,” “I plan to try the recommended technique.”
Keep surveys brief (2–4 well-chosen questions).
- Combine quick rating scales, emotion check-ins, and one optional open-text question.
Integrate seamlessly into chat flow at natural breakpoints.
- Use conversational micro-surveys so reflection feels natural, not intrusive.

NLP & Text Analysis: Examples

An exciting addition to the toolkit is using Natural Language Processing (NLP) methods to analyze what users say or write during their interactions. The actual conversation logs or written outputs can be mined for signals of cognitive or emotional change. Approaches include:

Analyzes users’ language to detect cognitive, emotional, and behavioral change over time.
Sentiment analysis: Tracks shifts in tone (e.g., anxious → confident). Flags frustration or increasing comfort.
Topic modeling & keyword analysis: Shows movement from basic → advanced topics; surfaces themes like “exam anxiety.”
LIWC / dictionary tools: Quantifies psychological markers (emotion words, cognitive-process words, social words).
- Example: more “we/us” → greater connection; fewer negatives (“never, not”) → reduced negativity.
LLM-based scoring: GPT-style models rate constructs directly (e.g., confidence 1–5, belonging 1–5).
- Enables nuanced user profiles (e.g., sentiment ↑, confidence ↑, independence 3/5).

Off-Platform Measures: Examples

In some cases, we need to look beyond the app itself to gauge real-world behavior changes. Especially for outcomes that manifest in daily life or over longer periods, we might collect data through:

Used to capture real-world changes that happen outside the app or over longer periods.
Longer surveys or interviews: Assess knowledge (quizzes), well-being, or behavior frequency.
- Example: “How often did you practice math this week?” or user reflections in interviews.
Observer reports: Teachers/parents provide external validation of changes.
- Example: “My child now approaches homework more confidently.”
Objective performance data: Link AI use to measurable outcomes.
- Examples: pre/post writing samples, test scores, task completion rates, health behaviors.
These measures provide strong evidence of user-level change and help verify that in-app improvements translate into real-world impact.

Level 4: Impact Evaluation

Question: Does the product improve development outcomes?

Why important: IEs measure effects on outcomes like mortality, learning, and earnings. The challenge: many things happen simultaneously, making simple before-and-after comparisons unreliable.

Solution: Use a counterfactual - a similar sample that didn’t receive the intervention. This captures what would have happened without the intervention, allowing us to isolate the intervention’s impact.

Level 4: Counterfactual Methods

Method	Description	Best For
RCT	Random assignment to treatment/control	Most credible; gold standard
Propensity Score Matching	Match on observable characteristics	When randomization not possible
Difference-in-Differences	Compare trends before/after	When parallel trends assumption holds
Regression Discontinuity	Compare units just above/below cutoff	When cutoff exists and is exogenous

RCTs are the most credible way to determine causal impact. Random assignment ensures differences can be attributed to the intervention, not population differences or external factors.

Who is Most Involved?

Role	Responsibility
Policy Researchers, Economists	Execute - Apply evaluation methods
AI Engineers	Support - Ensure product functions as expected

Why Do an Impact Evaluation?

There are 3 main reasons to do an impact evaluation:

Proof of concept: By isolating the effect of the intervention from the rest of the world, the impact evaluation allows you to causally attribute changes in outcomes to the intervention - giving you proof of concept.
Proof in different settings: Once you know it works in a particular setting, with a particular target population, you may want to show it will work in other settings or for other populations - then you can do additional impact evaluations.
Cost-benefit analysis: For many funders and public sector partners, IEs are critical for decision-making. They want credible evidence that a product meaningfully improves people’s lives - beyond engagement metrics or self-reported satisfaction - before committing to scale. A well-designed IE sends a strong signal that your product works in real-world conditions, and that scaling it is likely to generate meaningful social returns (see e.g. Hauser et al., 2025; UK GOV, 2025).

Why Do an Impact Evaluation?

IEs also help funders compare across opportunities. When paired with cost data, they allow for robust estimates of cost-effectiveness and cost-benefit analysis - crucial when governments, donors, and multilateral institutions are allocating scarce resources. In many cases, the result of an IE becomes a key input in decisions to scale, replicate, or exit.

Important: It is important to be clear on why you are doing the impact evaluation at the outset, as this will affect the data you collect and how you design the evaluation. For example, if you are doing a proof of concept evaluation, you may want to invest more in collecting a rich set of outcomes in your Level 3 evaluations to understand how these map to ultimate welfare outcomes.

When is it Appropriate?

IEs are high-investment undertakings, both financially (they often cost millions of dollars) and operationally (service providers have to adapt their operations, often in challenging ways, to make them work). They are most useful when your product is mature enough to test and when the decision stakes are high enough to justify the effort.

In general, consider an IE when:

Levels 1-3 are strong: The model performs well, users engage meaningfully, and early evidence suggests improvements in knowledge, attitudes, or behavior.
You are preparing to scale: Funders or policymakers are considering wider adoption, but want evidence, including cost-effectiveness or cost-benefit estimates, to support the decision.
You have bandwidth: Implementing an IE is a lot of work for both the research team and implementer; doing it well takes time and effort.
You are confident your product works: The world may be uncertain that your product has a meaningful impact, but you shouldn’t be. Your earlier stage evaluations should give you confident priors that you’ll find effects on development outcomes.

When is it Appropriate?

You do NOT need to run an IE if your product is still in early design or if usage is inconsistent to the point where you are worried about impacts. In those cases, Level 3 evaluations can be more appropriate.

Plan for Evaluability Early

Although impact evaluations are typically conducted at later stages, designing credible and cost-effective IEs often requires thinking about design decisions far earlier in the process. Incorporating features like holdout groups, staged rollouts, or embedded randomization into the initial product architecture (which could also be useful for A/B tests) ensures that rigorous causal evaluation remains possible - without requiring disruptive redesigns later on.

Even if a full IE is not yet justified, these design choices create structured opportunities for credible inference when the time comes and can significantly reduce the burden of evaluation. Funders assessing scale readiness should look for these signals of early evaluability.

How to Do an IE Responsibly

Rigorous IEs require expertise. We recommend working with an independent evaluator - such as an academic partner, a research NGO (e.g., J-PAL, IPA), or a third-party M&E firm. This enhances both the technical quality and the perceived credibility/independence of your evaluation.

At a minimum, we suggest:

Clarifying roles: Who builds the product, who runs the study, who communicates findings
Pre-registering the design: On platforms such as the AEA RCT Registry, EGAP, or RIDIE
Sharing results transparently: Disclose all findings, including null or negative results, and make methods and materials publicly available where feasible to support reproducibility and sector-wide learning

Impact Evaluation Methods

Randomized Control Trials: Random assignment to treatment/control
Propensity Score Matching: Match on observable characteristics
Difference-in-Differences: Compare trends before/after intervention
Regression Discontinuity Design: Compare units just above/below cutoff

Resources: Impact Evaluation in Practice (Gertler et al., World Bank), Running Randomized Evaluations (Glennerster & Takavarasha)

Focus: What is distinctive when evaluating AI-based products

Key Design Consideration 1: Counterfactual Selection

Options:

Pure control: No intervention at all
Business-as-usual: No digital support or sporadic human guidance
Non-AI digital tools: Static chatbots or curated content
Human-delivered services: When AI substitutes for scarce labor (measure costs: Cost Measurement Guide)

Important: Justify selection and explain what it helps illuminate

Key Design Consideration 2: Measuring Latent Access

Critical: Marginal benefit depends on what other support users already have

Measure:

Existing technology use (frequency, type, purpose)
What users rely on today (informal networks, human advisors, basic tech)
Leakage - how much control group has access to intervention

Why: Shapes incremental value added by AI product

Key Design Consideration 3: Managing Product Dynamism

Challenge: SUTVA assumption (same version for all treated units) often violated

AI products: Designed to improve iteratively - different participants may interact with different versions

Solutions:

Tag your versions
If A/B testing, randomize test participation
Maintain hold-out group on baseline version

-Pre-specify at high level (not overly detailed)

Key Design Consideration 4: Measuring True Development Outcomes

Challenge: AI tools often simulate expertise - does user learn or just copy?

Solutions:

Use industry-standard validated assessments
Use administrative data
Avoid measures that can be gamed by repeating AI output
Test ability when users don’t have access to AI

Key Design Consideration 5: Managing Spillovers and Contamination

Challenge: AI tools designed for scale - freely accessible, easy to share

Strategies:

Controlled access: Individual or cluster assignment
Publicly available: Randomized encouragement design
High contamination risk: Run in settings with low existing exposure
Cluster randomization: By school or clinic
Monitor usage: Be prepared to adjust power calculations

Common Pitfalls to Avoid

Being underpowered: Test uptake with small groups, bring pessimists into planning
Mismanaging transparency: Balance rigor and responsiveness
Letting product evolution obscure analysis: Pre-specify how changes handled analytically
Underestimating attrition risks: Track engagement from start, plan for dropouts

Summary: The Four-Level Framework

Level 1 - Model: Does the AI model produce desired responses? - Rubric → Metrics → Golden Dataset → Automated Evals → Refine → Red-team

Level 2 - Product: Does the product facilitate meaningful interactions? - A/B tests, engagement metrics, retention, feature uptake

Level 3 - User: Does product support thoughts, feelings, knowledge, behaviors? - Cognitive, affective, behavioral outcomes via surveys, NLP, behavioral measures

Level 4 - Impact: Does access improve development outcomes? - RCTs and other methods to measure causal impact on mortality, learning, earnings

Key Principles

Continuous evaluation: Rapid, ongoing cycles, not one-off experiments
Cross-functional teams: All four perspectives must be represented
Building blocks: User funnels, ETL pipelines, targeted hypotheses, experiments
Complementary methods: All levels work together at different stages
Right-sized evaluation: Make informed tradeoffs about what’s “enough”

Additional Resources

Level 1 Case Studies:

Level 3 Case Studies:

Questions?

Authored by:

November 2025

Please reach out to Zezhen Wu for questions or comments.

This is a living playbook. Current version grounded in lessons from AI4GD accelerator teams and experts across disciplines.