Proceedings of the National Academy of Sciences of the United States of America
Home Incorporating ethics and welfare into randomized experiments
Incorporating ethics and welfare into randomized experiments
Incorporating ethics and welfare into randomized experiments

Edited by Parag Pathak, Massachusetts Institute of Technology, Cambridge, MA, and accepted by Editorial Board Member Paul R. Milgrom September 30, 2020 (received for review May 4, 2020)

Author contributions: Y.N. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

Article Type: research-article Article History
Abstract

Randomized controlled trials (RCTs) determine the fate of numerous people, giving rise to a long-standing ethical dilemma. The goal of this paper is to alleviate this dilemma. To do so, this paper proposes and empirically implements an experimental design that improves subjects’ welfare while producing similar experimental information as typical RCTs do.

Randomized controlled trials (RCTs) enroll hundreds of millions of subjects and involve many human lives. To improve subjects’ welfare, I propose a design of RCTs that I call Experiment-as-Market (EXAM). EXAM produces a welfare-maximizing allocation of treatment-assignment probabilities, is almost incentive-compatible for preference elicitation, and unbiasedly estimates any causal effect estimable with standard RCTs. I quantify these properties by applying EXAM to a water-cleaning experiment in Kenya. In this empirical setting, compared to standard RCTs, EXAM improves subjects’ predicted well-being while reaching similar treatment-effect estimates with similar precision.

Keywords
Narita: Incorporating ethics and welfare into randomized experiments

Today is the golden age of randomized controlled trials (RCTs). RCTs started out as safety and efficacy tests of farming and medical treatments, but have since grown to become the society-wide standard of evidence.

RCTs involve large numbers of participants. Between 2007 and 2017, over 360 million patients and 22 million individuals participated in registered clinical trials and social RCTs, respectively. Moreover, these experiments often randomize high-stakes treatments. For instance, in a glioblastoma therapy trial (1), the 5-y death rate of glioblastoma patients was 97% in the control group, but only 88% in the treatment group. In expectation, therefore, the lives of up to 9% of the study’s 573 participants depended on who received treatments. Social RCTs also often randomize critical treatments such as basic income, high-wage job offers, and HIV testing.

RCTs, thus, influence the fate of many people around the world, raising a widely recognized ethical concern with the randomness of RCT treatment assignment: “How can a physician committed to doing what he thinks is best for each patient tell a woman with breast cancer that he is choosing her treatment by something like a coin toss? How can he give up the option to make changes in treatment according to the patient’s responses?” (ref. 2, p. 1385).

To address this ethical concern, this paper develops an experimental design that optimally incorporates subject welfare. I define welfare by two measures: 1) the predicted effect of each treatment on each subject’s outcome; and 2) each subject’s preference or willingness to pay (WTP) for each treatment. My experimental design improves welfare compared to RCTs, while also providing unbiased estimates of treatment effects. The proposed design thereby extends prior pioneering designs that incorporate only parts of the welfare measures (3456789). This proposal also complements clinical-trial regulations that safeguard patients from excessive experimentation (10), as well as adaptive experimental designs to most precisely estimate treatment effects (11).

I start by defining experimental designs as procedures that determine each subject’s treatment-assignment probabilities based on data about the two welfare measures. In practice, the experimenter may estimate the welfare measures from prior experimental or observational data, or ask subjects to self-report them (especially WTP).

I propose an experimental design that I call Experiment-as-Market (EXAM). I choose this name because EXAM is an experiment based on an imaginary centralized market and its competitive equilibrium (12, 13). EXAM first endows each subject with a common artificial budget and lets her use the budget to purchase the most preferred (highest WTP) bundle of treatment-assignment probabilities given their prices. The prices are personalized so that each treatment is cheaper for subjects with better predicted effects of the treatment. EXAM computes its treatment-assignment probabilities as what subjects demand at market-clearing prices, where subjects’ aggregate demand for each treatment is balanced with its supply or capacity (assumed to be exogenously given). EXAM, finally, requires every subject to be assigned to every treatment with a positive probability.

This virtual-market construction gives EXAM nice welfare and incentive properties. EXAM is Pareto optimal, in that no other design makes every subject better off in terms of expected predicted effects of and WTP for the assigned treatment. EXAM also allows the experimenter to elicit WTP in an asymptotically incentive-compatible way. That is, when the experimenter asks subjects to self-report their WTP for each treatment to be used by EXAM, every subject’s optimal choice is to report her true WTP, at least for large experiments.

Importantly, EXAM also allows the experimenter to estimate the same treatment effects as standard RCTs do. Intuitively, this is because EXAM is an experiment stratified on observable predicted effects and WTP, in which the experimenter observes each subject’s assignment probabilities (propensity scores). As a result, EXAM’s treatment assignment is random (independent from anything else), conditional on the observables. The conditionally independent treatment assignment allows the experimenter to unbiasedly estimate the average treatment effects (ATEs) conditional on observables. By integrating such conditional effects, EXAM can unbiasedly estimate the (unconditional) ATE and other effects, as is the case with any stratified experiment (14).

Power is also a concern. I characterize the statistical efficiency in EXAM’s ATE estimation. In general, the standard error comparison of EXAM and a typical RCT is ambiguous, as is often the case with comparing RCTs and stratified experiments. This motivates an empirical comparison of the two designs to confirm and quantify the power, unbiasedness, welfare, and incentive properties.

I apply EXAM to data from a water-cleaning experiment in Kenya (15). Compared to RCTs, EXAM turns out to substantially improve participating households’ predicted welfare. Here, welfare is measured by predicted effects of clean water on child diarrhea and revealed WTP for water cleaning. EXAM is also found to almost always incentivize subjects to report their true WTP. Finally, EXAM’s data produce treatment-effect estimates and standard errors similar to those from RCTs. EXAM, therefore, produces information that is as valuable for the outside society as that from RCTs.

Taken together, EXAM sheds light on a way economic thinking can “facilitate the advancement and use of complex adaptive (…) and other novel clinical trial designs,” a performance goal by the US Food and Drug Administration for 2018–2022. Experimental design is a potentially life-saving application of economic market design (16, 17).

EXAM

An experimental design problem consists of:

    Experimental subjects i=1,,n.

    Experimental treatments t0,t1,,tm, where t0 is a control.

    Each subject i’s preference or WTP witR for treatment t, where witwit means subject i weakly prefers treatment t over t.

    Each treatment t’s predicted treatment effect etiR for subject i, where etieti means treatment t is predicted to have a weakly better effect than t for subject i. When multiple outcomes matter, eti can be set to the predicted effect on a known function of these outcomes.

I assume eti and wit to be deterministic for simplicity. Without loss of generality, I normalize eti and wit by assuming et0i=wit0=0 for every subject i.

An experimental design specifies treatment-assignment probabilities (pit), where pit is the probability that subject i is assigned to treatment t under the experimental design. The benchmark design is the standard RCT, which assigns each subject i to each treatment t with the impersonal treatment-assignment probability ptRCT, assumed to be written as ptRCT=ct/n for some natural number ct<n. ct is the quasi-capacity or supply of treatment t. The vast majority of clinical trials use RCT.

This paper investigates welfare enhancement with a design that I call EXAM. The steps for implementing EXAM are as follows.

    1)Obtain predicted effects eti if possible and relevant, as detailed in SI Appendix, section 1.B.

    2)Obtain WTP wit if possible and relevant, as described in SI Appendix, section 1.B.

    3)Apply the following definition to the data from steps 1 and 2, producing EXAM’s assignment probabilities pit*(ϵ).

Definition 1 (EXAM):

In the experimenter’s computer, distribute any common artificial budget b>0 to every subject. Find any price-discriminated competitive equilibrium, i.e., any treatment-assignment probabilities (pit*) and their prices πte with the following properties:

    Effectiveness-discriminated treatment pricing: There exist α<0 and βtR for each treatment t such that the price of a unit of assignment probability to t for subjects with eti=eR is

    πte=αe+βt.

    This price is decreasing in treatment-effect prediction e, in order to assign each treatment with a higher probability to subjects who benefit more from it.

    Subject utility maximization: For each subject i,

    (pit*)targmaxpiPtpitwits.t.tpitπtetib,

    where pi(pit)t and P{piRm+1|t=t0tmpit=1and|pit|p}, where p is a large enough number. πteti is the price of a unit of the assignment probability to treatment t for subject i. EXAM breaks ties or indifferences so that every subject i’s pi* solves the above problem with the minimum expenditure tpitπteti, while (pit*)t=(pjt*)t for any subjects i and j with wi=wj and ei=ej, where wi(wit1,,witm) and ei(et1i,,etmi).

    Capacity constraints: ipit*ct for every treatment t=t1,,tm and ipit*<ct only if πteti0 for every i.

Let ϵ be a nonnegative number such that the experimenter would like the assignment probabilities to be always within [ϵ,1ϵ]. Take any ϵ[0,ϵ¯] as given, where ϵ¯mintptRCT is the largest possible value of ϵ. I define EXAM’s treatment-assignment probabilities as

pit*(ϵ)(1q)pit*+qptRCT,
where qinf{q[0,1]|(1q)pit*+qptRCT[ϵ,1ϵ]for alliandt}.

A few remarks are in order. First, among the above steps, subjects only need to report their WTP wit. The remaining parts are run by the experimenter. Second, it is possible to let different subjects have different budgets. I make b the same for every subject, so that EXAM has the equality property that no subject strictly prefers anybody else’s treatment-assignment probabilities over her own. Finally, α,βt and the resulting pit*(ϵ) are the equilibrium objects to be found by the experimenter, so as to satisfy the equilibrium constraints. SI Appendix, section 3.B provides an algorithm that finds equilibrium values of α and βt by adjusting their values so as to decrease excess demand or supply.

Welfare and Incentive.

EXAM is an enrichment of RCT, as formalized below.

Proposition 1 (EXAM Nests RCT).

Suppose that WTP and predicted effects are unknown or irrelevant, so that wit=wjt>0 and eti=etj for all subjects i and j and treatment t. Then, EXAM reduces to RCT using simple randomizationi.e., for every ϵ[0,ϵ¯], subject i, and treatment t, I have

pit*(ϵ)=ptRCT.
However, in cases where the experimenter is concerned about WTP or predicted effects, EXAM differs from RCT and is welfare-optimal.

Proposition 2 (Existence and Welfare).

There exists pit* that satisfies the conditions in Definition 1. For any such pit* and any ϵ[0,ϵ¯], no other experimental design (pit)Pn has the following better welfare property: pit[ϵ,1ϵ] for all subjects i and treatments t, ipitct for all t=t1,,tm, and:

tpitwittpit*(ϵ)witandtpitetitpit*(ϵ)eti,
for all i with at least one strict inequality.

Proposition 2 says that no other experimental design ex ante Pareto dominates EXAM in terms of the expected WTP for and predicted effect of assigned treatment (while satisfying the random-assignment and capacity constraints). In contrast, RCT fails to satisfy the welfare property, as it ignores WTP and predicted effects. I empirically quantify the welfare gap between RCTs and EXAM below.

Proposition 2 takes WTP wit as given and assumes that it represents true WTP. In practice, the experimenter often needs to elicit the WTP information wit from subjects, raising an incentive-compatibility concern. Unfortunately, no experimental design satisfies both the welfare property in Proposition 2 and exact incentive compatibility for general problems (12). This compels me to investigate approximate incentive compatibility in large experimental design problems. Only for this section, consider a sequence of experimental design problems (1,,n,t0,t1,,tm,(ctn))nN indexed by the number of subjects, n. Let ϵn[0,ϵ¯n) (where ϵ¯n is ϵ¯ for the n-th problem) be the value of the bound parameter ϵ the experimenter picks for the n-th problem in the sequence. The set of treatments t0,t1,,tm is fixed, but everything else may change as n increases.

To investigate the incentive structure in EXAM, imagine that subjects report their WTP to EXAM. EXAM then uses the reported WTP to compute treatment-assignment probabilities. For the n-th problem in the sequence, let pi*n(wi,ei,wi,ei;ϵn) be EXAM’s treatment-assignment probability vector for subject i when subjects report WTP (wi,wi), and predicted effects are (ei,ei) where wi(wj)ji and ei(ej)ji. I extend this notation to the case where other subjects’ WTP reports and predicted effects are random:

pi*n(wi,ei,F;ϵn)(wi,ei)(W×E)n1pi*nwi,ei,wi,ei;ϵn×Pr{(wi,ei)iidF}×d(wi,ei).
Here, Pr{(wi,ei)iidF} denotes the probability that vector (wi,ei)(wj,ej)ji is realized from n1 independent and identically distributed (iid) draws (wj,ej) from the distribution FΔ(W×E). Δ(W×E) is the set of full-support distributions over the finite WTP space W and the predicted effect space E. The iid assumption is based on the idea that there are many subjects, so they do not distinguish other subjects ex ante.

EXAM approximately incentivizes every subject to report her true WTP, at least for large enough experimental design problems.

Proposition 3 (Incentive).

For any sequence of experimental design problems with any ϵn in [0,ϵ¯n), any FΔ(W×E), any δ>0, there exists n0 such that, for any nn0, subject i, predicted effect ei, true and manipulated WTP values wi and wi, I have

tpit*n(wi,ei,F;ϵn)×wittpit*n(wi,ei,F;ϵn)×witδ.
The experimenter using EXAM can, therefore, ask subjects to report their true WTP without any deception. I also provide empirical support for incentive compatibility below.

Information

Despite the welfare merit, EXAM also lets the experimenter estimate treatment effects as unbiasedly as they would do in RCTs. To spell it out, here, I switch back to any given finite problem with fixed WTP and predicted effects.

Suppose the experimenter is interested in the causal effect of each treatment on an outcome Yi. Let Yi(t) denote subject i’s potential outcome that would be observed if subject i receives treatment t. Let Dit be the indicator that subject i is ex post assigned to treatment t. The observed outcome is written as Yi=tDitYi(t). While Yi(t) is assumed to be fixed, Dit and Yi are random variables, the distributions of which depend on the experimenter’s choice of an experimental design. Let Y(Yi), Di(Dit)t, and D(Di).

The experimenter would like to learn any parameter of interest θ of the distribution of potential outcomes Yi(t)’s, many of which are unobservable. Formally, θ is any mapping θ:Rn×(m+1)R that maps each possible value of (Yi(t)) into the corresponding value of the parameter. For example, θ may be the ATE (ATEt) of treatment t over control t0, i=1n(Yi(t)Yi(t0))n. The experimenter estimates θ with an estimator θ^(Y,D), a function only of observed outcomes and treatment assignments. I say parameter θ is estimable without bias with experimental design p(pit) if there exists a “simple” estimator θ^(Y,D) (in the sense defined in SI Appendix, section 1.C) such that E(θ^(Y,D)|(pit))=θ, where E(|(pit)) is expectation with respect to the distribution of Dit induced by experimental design (pit).

EXAM turns out to be as informative as RCT in terms of the set of parameters estimable without bias with each experimental design.

Proposition 4 (Estimability without Bias).

Under regularity conditions in SI Appendix, if parameter θ is estimable without bias with RCT ptRCT, then θ is also estimable without bias with EXAM pit*(ϵ) with any ϵ>0.

Many key parameters, such as the ATE and the treatment effect on the treated, are known to be estimable without bias with RCT and a simple estimator. Proposition 4 implies that these parameters are also estimable without bias with EXAM.

Corollary 1.

The ATE and the treatment effect on the treated are estimable without bias with EXAM.

I use the ATE to illustrate the intuition for and implementation of Proposition 4 and Corollary 1. Why is ATE estimable without bias with EXAM? The reason is that once it is constructed, EXAM is a particular stratified experiment stratified on observable WTP and predicted effects. EXAM, therefore, produces treatment assignment that is independent from (unconfounded by) potential outcomes conditional on predicted effects and WTP, which are observable to the experimenter:

Conditional independence (3) implies that the same conditional independence holds conditional on the assignment probability pi*(ϵ)(pit*(ϵ))t (ref. 14, section 12.3), which is, again, known to the econometrician:
This conditionally independent treatment assignment allows the experimenter to unbiasedly estimate the conditional ATEs of each t over t0 conditional on observable propensity scores pi*(ϵ),
i=1n1{pi*(ϵ)=p}(Yi(t)Yi(t0))i=1n1{pi*(ϵ)=p},
which I denote by CATEpt. By summing up conditional effects, the experimenter can back out the (unconditional) ATE, the single most important causal object identified and estimated by RCTs. That is, with weights δpi=1n1{pi*(ϵ)=p}/n, I use CATEpt’s to get ATE as follows: pδpCATEpt=ATEt. The above estimability argument motivates a strategy to estimate ATE with EXAM’s data. As a warm-up, we focus on {i|pi*(ϵ)=p}, the subpopulation of subjects with propensity vector p, and consider this regression on the subpopulation:
Yi=αp+t=t1tmβptDit+ϵi.
By conditional independence (4), the ordinary least-squares (OLS) estimate β^pt from this regression is unbiased for CATEpt for each treatment tt0. I then aggregate the resulting estimates β^pt’s into pδpβ^pt, which I denote by β^t*. This estimator β^t* unbiasedly estimates the ATE with its variance in an analytical form, as shown in SI Appendix.

Alternatively, empirical researchers may prefer a single regression:

Yi=a+t=t1tmbtDit+t=t1tmctpit*(ϵ)+ei, [3]
producing an alternative estimator b^t*. As verified in SI Appendix, b^t* is an unbiased estimator of a differently weighted treatment effect:
E(b^t*|p*(ϵ))=pλptCATEptpλptwith weightsλptδppt(1pt). [4]
Estimators like b^t* and β^t* allow the experimenter to unbiasedly estimate key causal effects with EXAM.

Empirical Application

My empirical test bed for EXAM is an application to a spring-protection experiment in Kenya. Waterborne diseases, especially diarrhea, remain the second leading cause of death among children, comprising about 17% of child deaths under age five (about 1.5 million deaths each year). The only quantitative United Nations Millennium Development Goal is in terms of “the proportion of the population without sustainable access to safe drinking water and basic sanitation,” such as protected springs. Yet, there is controversy about its health impacts. Experts argue that improving source-water quality may only have limited effects, since, for example, water is likely recontaminated in transport and storage.

This controversy motivated researchers to analyze randomized spring protection in Kenya (15). This experiment randomly selected springs to receive protection from the universe of 200 unprotected springs. The experimenter selected and followed a representative sample of about 1,500 households that regularly used some of the 200 springs before the experiment; these households are the experimental subjects. The researchers found that diarrhea among children in treatment households fell by about a quarter of the baseline level. I call this real experiment the “original experiment” and distinguish it from EXAM and RCT as formal concepts in my model.

I consolidate the original experimental data and my methodological framework to empirically evaluate EXAM. Applying the language and notation of my model, experimental subjects are households in the original experiment’s sample. The protection of the spring each household uses at baseline is a single treatment t1, while no protection is the control t0. Each household i’s WTP for better water access t1 is denoted by wit1, which I estimate in SI Appendix, section 3A and Table S2. I also estimate the heterogeneous treatment effect et1i of spring protection t1 on household i’s child diarrhea outcome (SI Appendix, Table S1 and Fig. S1).

Given the estimates, imagine somebody is planning a new experiment to further investigate the same spring-protection treatment. What experimental design should she use? Specifically, which is better between RCT and EXAM? My approach is to use the estimated WTP ŵit1 and predicted effects êt1i to simulate EXAM and compare EXAM with RCT in terms of welfare, information, and incentive properties. I fix the set of subjects and treatments as in the original experiment. That is, there are 1,540 households as subjects to be assigned either to the water-source-protection treatment t1 or the control t0. Set the treatment capacity ct1 to be the number of households assigned to the treatment t1 in the original experiment. I set the bound parameter ϵ to be 0.2. I fix predicted effects et1i to their point estimate êt1i.

I start with evaluating EXAM’s welfare performance. I use EXAM’s treatment-assignment probabilities pit1*(ϵ) to calculate two welfare measures for each household i:

wi*tpit*(ϵ)witandei*tpit*(ϵ)eti.
wi* and ei* are empirical analogs of the two welfare measures in my theoretical welfare analysis (Proposition 2).

I find that EXAM improves on RCT in terms of the welfare measures wi* and ei*, a result reported in Fig. 1. The mean of average WTP wi* for assigned treatments is about 89% higher under EXAM than it is under RCT. Another interpretation of this WTP improvement is about 37% of the average WTP for the treatment. Similarly, EXAM improves the mean of ei* by about 0.8% absolute reduction in child diarrhea (equivalently, 42% reduction relative to RCT’s level). This predicted effect benefit amounts to about 17% of the ATE of the spring protection found by the original experiment.

EXAM vs RCT: Welfare. This figure shows the distribution of average subject welfare over 1,000 bootstrap simulations under each experimental design. A shows the average WTP for assigned treatments wi*, where WTP is measured by the equivalent number of workdays, as described in SI Appendix, section 3.A. B shows the average predicted effects of assigned treatments ei*. A dotted line indicates the distribution of each welfare measure for RCT, while a solid line indicates that for EXAM. Each vertical line represents mean. Kolmogorov–Smirnov tests find the EXAM and RCT distributions to be significantly different both for wi* and ei*.
Fig. 1.

EXAM vs RCT: Welfare. This figure shows the distribution of average subject welfare over 1,000 bootstrap simulations under each experimental design. A shows the average WTP for assigned treatments wi*, where WTP is measured by the equivalent number of workdays, as described in SI Appendix, section 3.A. B shows the average predicted effects of assigned treatments ei*. A dotted line indicates the distribution of each welfare measure for RCT, while a solid line indicates that for EXAM. Each vertical line represents mean. Kolmogorov–Smirnov tests find the EXAM and RCT distributions to be significantly different both for wi* and ei*.

Data from EXAM also allow me to obtain more or less the same conclusion about treatment effects as RCT. To see this, I augment the above counterfactual simulation with ATE estimation as follows: I first simulate wit1 and run EXAM to get treatment-assignment probabilities pit*(ϵ). I use pit*(ϵ) to draw a final deterministic treatment assignment, denoted by a binary indicator Di indicating that i is ex post assigned to t1. I then simulate counterfactual or predicted outcome Yi under Di by simulating the treatment-effect model I estimate in SI Appendix, section 3.A. Finally, I use the above simulated Yi and Di to estimate treatment effects with b^* from this OLS regression:

Yi=a+bDi+cpit1*(ϵ)+ei,
where I control for propensity score pit1*(ϵ) to make treatment assignment Di random. This regression is a stripped-down version of the regression strategy (5). I also implement the other propensity-score-weighting estimator β^*. The procedure for RCT is analogous, except that the treatment-assignment probability is fixed at ptRCT.

Program evaluation with EXAM is as unbiased and precise as that with RCT. Fig. 2A and SI Appendix, Fig. S2 plot the distribution of the resulting treatment effect estimates b^* and β^* over simulations. In line with Propositions 4 and 5 in SI Appendix, the means of b^* and β^* for EXAM are indistinguishable from those under RCT. Both designs successfully recover ref. 15’s ATE estimate (4.5% reduction in diarrhea; replicated in SI Appendix, Table S1).

EXAM vs RCT: Treatment-effect estimates. This figure compares EXAM and RCT’s causal-inference performance by showing the distribution of ATE estimates under each design. A shows the distribution of treatment-effect estimates b^*, and B shows robust P values for b^*. In A, blue bins indicate ATE estimates for RCT, while transparent bins with red outlines indicate those for EXAM. The solid vertical line indicates the mean for EXAM, while the dashed vertical line indicates that for RCT. In B, mean is represented by a solid line and median by a dashed line. The P values are based on robust standard errors. Blue bins indicate P values for RCT, while transparent bins with red outlines indicate those for EXAM. The solid vertical line indicates median for EXAM, while the dashed vertical line indicates that for RCT.
Fig. 2.

EXAM vs RCT: Treatment-effect estimates. This figure compares EXAM and RCT’s causal-inference performance by showing the distribution of ATE estimates under each design. A shows the distribution of treatment-effect estimates b^*, and B shows robust P values for b^*. In A, blue bins indicate ATE estimates for RCT, while transparent bins with red outlines indicate those for EXAM. The solid vertical line indicates the mean for EXAM, while the dashed vertical line indicates that for RCT. In B, mean is represented by a solid line and median by a dashed line. The P values are based on robust standard errors. Blue bins indicate P values for RCT, while transparent bins with red outlines indicate those for EXAM. The solid vertical line indicates median for EXAM, while the dashed vertical line indicates that for RCT.

Perhaps more importantly, the distributions of b^* and β^* for EXAM have similar SDs as those for RCT. This means that the two experimental designs produce similar exact, finite-sample standard errors in their estimates b^* and β^*. Variations of this observation are in Fig. 2B, which shows the distribution of the robust P values for the estimates b^*. SI Appendix, Fig. S3 additionally shows P values based on exact, nonrobust, and finite population causal standard errors, where the exact standard error means the SD in the distribution of b^* in Fig. 2. RCT produces slightly smaller P values than EXAM, but the median P value is about 0.03 for RCT and about 0.04 for EXAM. Both EXAM and RCT, therefore, detect a significant ATE for a majority of cases. Overall, EXAM appears to succeed in its informational mission of eliminating selection bias and recovering ATE precisely enough.

EXAM’s WTP benefits can be regarded as welfare-relevant only if EXAM provides subjects with incentives to reveal their true WTP. I conclude my empirical analysis with an investigation of the incentive compatibility of EXAM. I repeat the following procedure many times: As before, I simulate wit1 and run EXAM to get treatment-assignment probabilities pit*(ϵ). I then randomly pick one subject j as a WTP manipulator and one potential WTP manipulation wjt1 by j. I choose the manipulator j uniformly randomly. The manipulation wjt1 is from N(wjt1,100), where wjt1 is j’s true WTP. SI Appendix reports similar results under different scenarios, covering different types of misreporting—that is, both overreporting and underreporting with different magnitudes. I run EXAM on the simulated data but with the WTP manipulation wjt1 to get treatment-assignment probabilities pit(ϵ). I finally compute the gain from the manipulation wjt1:

Δwtpit(ϵ)wjttpit*(ϵ)wjt.
EXAM is found to give subjects little incentive for WTP misreporting, empirically verifying Proposition 3. Fig. 3 and SI Appendix, Fig. S4 show this by drawing the distribution of Δw over simulations and households. Across all scenarios, the WTP gain Δw from misreporting is mostly negative and well below zero on average. SI Appendix, Table S3 shows that even the most profitable manipulations in Fig. 3 lead to normalized gains w/wit1 smaller than 0.021. This result suggests that there are unlikely to be manipulations that produce large gains. Overall, in this empirical setting, EXAM provides subjects with stronger average incentives for truthful WTP reporting than RCT does (because RCTs are indifferent among all possible WTP reports). EXAM may, therefore, be better at eliciting reliable WTP data.

EXAM vs RCT: Incentive (WTP manipulation ∼ true WTP+N(0,100)). This figure shows the histogram of true WTP gains from potential WTP misreporting to EXAM, quantifying the incentive compatibility of EXAM. Each solid vertical line represents the mean WTP gain from potential WTP misreporting to EXAM. The dashed vertical line is for RCT, where the true WTP gain from any WTP misreport is zero.
Fig. 3.

EXAM vs RCT: Incentive (WTP manipulation true WTP+N(0,100)). This figure shows the histogram of true WTP gains from potential WTP misreporting to EXAM, quantifying the incentive compatibility of EXAM. Each solid vertical line represents the mean WTP gain from potential WTP misreporting to EXAM. The dashed vertical line is for RCT, where the true WTP gain from any WTP misreport is zero.

If the experimenter was only interested in the most precise estimation of treatment effects, a possible experimental design would be a stratified RCT that stratifies on the same covariates as EXAM. Another possible alternative is the stratified RCT of Hahn et al. (11).

Such designs may dominate EXAM in terms of the statistical efficiency in ATE estimation, though are inferior to EXAM in terms of welfare and incentive properties. In this sense, there is a tradeoff between information and welfare/incentive.

Conclusion

Motivated by the high-stakes nature of RCTs, I propose a data-driven experiment dubbed EXAM. EXAM is a particular stratified experiment derived from a hybrid experimental-design-as-market-design problem of maximizing participants’ welfare subject to the constraint that the experimenter must produce as much information and incentives as in RCTs (Propositions 2–4). These properties are verified and quantified in an empirical application where I simulate my design on a water-source-protection experiment. The body of evidence suggests that EXAM improves subject well-being with little information and incentive costs.

The author declares no competing interest.
This article is a PNAS Direct Submission. P.P. is a Guest Editor invited by the Editorial Board.
This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2008740118/-/DCSupplemental.

Data Availability.

Stata and cvs files data have been deposited in GitHub (https://github.com/aneesha94/EXaM-Public-folder).

References

R. Stupp., Effects of radiotherapy with concomitant and adjuvant temozolomide versus radiotherapy alone on survival in glioblastoma in a randomised phase III study: 5-year analysis of the EORTC-NCIC trial. Lancet Oncol. 10, 459466 (2009).

M. Angell, Patients’ preferences in randomized clinical trials. N. Engl. J. Med. 310, 13851387 (1984).

M. Zelen, A new design for randomized clinical trials. N. Engl. J. Med. 300, 12421245 (1979).

J. Angrist, G. Imbens, Sources of identifying information in evaluation models. https://www.nber.org/system/files/working_papers/t0117/t0117.pdf.

S. Chassang, G. P. i Miquel, E. Snowberg, Selective trials: A principal-agent approach to randomized controlled experiments. Am. Econ. Rev. 102, 12791309 (2012).

M. Zelen, Play the winner rule and the controlled clinical trial. J. Am. Stat. Assoc. 64, 131146 (1969).

L. Wei, S. Durham, The randomized play-the-winner rule in medical trials. J. Am. Stat. Assoc. 73, 840843 (1978).

F. Hu, W. F. Rosenberger, Optimality, variability, power: Evaluating response-adaptive randomization procedures for treatment comparisons. J. Am. Stat. Assoc. 98, 671678 (2003).

M. Kasy, A. Sautmann, Adaptive treatment assignment in experiments for policy choice. https://maxkasy.github.io/home/files/papers/adaptiveexperimentspolicy.pdf.

10 

L. M. Friedman, C. Furberg, D. L. DeMets, D. M. Reboussin, C. B. Granger, Fundamentals of Clinical Trials (Springer, Cham, Switzerland, 1998), vol. 3.

11 

J. Hahn, K. Hirano, D. Karlan, Adaptive experimental design using the propensity score. J. Bus. Econ. Stat. 29, 96108 (2011).

12 

A. Hylland, R. J. Zeckhauser, The efficient allocation of individuals to positions. J. Polit. Econ. 87, 293314 (1979).

13 

E. Budish, Y. K. Che, F. Kojima, P. Milgrom, Designing random allocation mechanisms: Theory and applications. Am. Econ. Rev. 103, 585623 (2013).

14 

G. W. Imbens, D. B. Rubin, Causal Inference in Statistics, Social, and Biomedical Sciences (Cambridge University Press, Cambridge, UK, 2015).

15 

M. Kremer, J. Leino, E. Miguel, A. P. Zwane, Spring cleaning: Rural water impacts, valuation, and property rights institutions. Q. J. Econ. 126, 145205 (2011).

16 

P. R. Milgrom, Putting Auction Theory to Work (Cambridge University Press, Cambridge, UK, 2004).

17 

A. E. Roth, Who Gets What and Why: The New Economics of Matchmaking and Market Design (Houghton Mifflin Harcourt, Boston, MA, 2015).