Standards of Evidence
Evidence: Something that furnishes proof or tends to furnish proof (Webster)
Confusion exists around the term Evidence-Based Programs because of multiple definitions in various program registries and misunderstandings by many of what is minimally required to provide sufficient proof that a program is effective. Blueprints for Healthy Youth Development has created an Evidence Continuum graphic to provide clarification as to the level of evidence required to be called an Evidence-Based Program.
Evidence in support of the effectiveness of a program, practice, or policy falls on a continuum ranging from very low to very high levels of confidence. The more rigorous the research design of evaluations and the greater the number of positive evaluations, the greater confidence users can have that the intervention will reach its goal of helping youth.
Evidence with the lowest level of confidence is "opinion informed." This includes information such as anecdotes, testimonials, and personal experiences obtained from a few individuals. A satisfaction survey is only a step above, as it still involves opinions of a program, even if based on a larger sample. This type of evidence, while useful in developing a program in the early stages, fails to examine targeted youth outcomes in a systematic way. It does not provide any real "proof" of effectiveness and ranks "very low" on the confidence continuum.
Research-informed studies rely on more than testimonial or professional insight by gathering data on youth outcomes from surveys, agency records, or other sources. They provide some evidence of effectiveness, but the level of confidence is "low." The basic problem is that they do not isolate the impact of the program from other possible influences on targeted youth outcomes.
Correlational studies can reveal if a relationship exists between a program and a desired outcome (i.e., a positive relationship, a negative relationship or no relationship). However, demonstrating that a relationship exists does not prove that one variable caused the other. For example, a correlational study might show that being in a community-based treatment program was related to lower recidivism rates than being in a state correctional program, but this finding does not prove that the lower recidivism rate was due to being in the community program. The judge may be sentencing more serious offenders to the correctional program; those in the community program may have come from slightly better homes or schools or more positive peer groups, all of which could explain the difference in recidivism. Correlational studies cannot show that it was the program that actually caused the difference in observed outcome.
Other research-informed studies provide evidence of effectiveness by collecting survey data from program participants at posttest only or pretest and posttest. Because a control or comparison group is lacking, it is not clear that the program caused posttest outcomes or changes from pretest to posttest. Changes may well have occurred among similar subjects not going through the program. In other words, we cannot attribute the outcomes to the program, as the outcomes may have been produced by other factors.
Thus, research-informed studies lack an appropriate comparison group and evidence of a causal effect. These studies provide some preliminary support for a program that can help justify more rigorous experimental evaluation, but they rate low on the confidence continuum.
Experimental and Experimentally Proven (Evidence-Based Programs)
At the higher end of the continuum are "experimental" and "experimentally proven" studies. These comprise what is commonly referred to as "evidence-based programs (EBPs)." Virtually all web-based registries of EBPs require experimental evidence for certification as an EBP. All experimental studies use designs that involve comparison or control groups. If participants receiving the program have better outcomes than those in the comparison or control groups, that is, those not receiving the program, the program likely is having the intended effect (i.e, is the cause of this effect). However, levels of confidence and evidence of effectiveness attributed to experimental studies can vary from moderate to very high.
At the moderate range of confidence are a set of designs that are commonly called quasi-experimental designs (QEDs). The three identified on the graph are the most frequently utilized of this type of design. These designs all lack the element of random assignment that characterize randomized control trials (RCTs) and the certainty that the intervention and control groups are identical at the start of the study. Comparison groups may be matched on measured characteristics at the start of an evaluation study but may nonetheless differ on unmeasured characteristics. For example, schools adopting a program may be matched to other schools on achievement test scores or on gender, race, and socioeconomic composition. However, the schools receiving the intervention may have more motivated staff and students than the other schools and thus bias tests for program effectiveness. The better the matching, the more accurate the evaluation of the program will be, but the level of confidence remains moderate as the potential for bias remains.
A higher level of evidence comes from randomized controlled trials (RCTs), where participants are randomly assigned to treatment and control groups. Randomization assures that the treatment and control groups are essentially identical on both measured and unmeasured characteristics at the start of the study. Equalizing pre-existing characteristics for these groups through randomization leaves only program participation and the randomization process, i.e., chance, as the possible causes of differences in outcomes and chance can be ruled out based on a test of statistical significance.
The highest level of confidence comes from multiple RCTs that show program benefits in different samples of randomly assigned subjects. Consistent findings from multiple RCTs across different sites greatly reduces the likelihood that chance explains the observed differences between intervention and control groups. Consistent findings from multiple RCTs also increases the generalizability of the program, the likelihood that the program will be effective in a wide range of intervention sites and types of subjects. Only programs with multiple RCTs have the level of confidence necessary to take to scale. An additional factor that would increase confidence would be RCTs conducted by investigators independent of the developer (and with no financial interest).
Direction of Effects
The direction of effects must also be considered. Studies can demonstrate positive effects, no effects, and harmful effects. Evidence that a program is ineffective or harmful requires the same level of evidence and confidence as the claim that a program is effective.
A Comment about the Quality of Evaluation Studies and Blueprint Program Ratings
The confidence continuum assumes that experimental evaluations are "high quality" studies. Blueprints for Healthy Youth Development has a rigorous peer review process that sets it apart from other program registries. An internal review identifies possible qualifying programs, and a highly-qualified and distinguished advisory board reviews and makes final recommendations certifying only experimentally evaluated programs that have been demonstrated effective in "high-quality" evaluations. Model programs have multiple RCTs or one RCT in combination with a QED. Most have multiple RCTs. Promising programs have at least one RCT or two QEDs. Many experimental evaluations of programs fail to meet Blueprints quality standards. Some of the more common problems that prevent Blueprints certification of a program include: (a) failure to demonstrate that the randomization produced intervention and control group equivalence at baseline, (b) failure to show that attrition or loss of subjects does not compromise the randomization, (c) failure to use appropriate statistical measures that, for example, adjust for randomizing subjects at one level (such as school) and conducting the analysis at a different level (such as individual), and (d) failure to follow and analyze all subjects as assigned to their original condition (intent to treat). Not only is the type of design important to Blueprints but so is the overall quality of the study and its ability to establish a cause-effect relationship.
Created by: Blueprints for Healthy Youth Development in collaboration with the Blueprints Policy Group, Nov. 11, 2015