Robert E. Slavin
Learning how to assess the validity of education research is vital for creating effective, sustained reform.
In every successful, dynamic part of our economy, evidence is the
force that drives change. In medicine, researchers continually develop
medications and procedures, compare them with current drugs and
practices, and if they produce greater benefits, disseminate them
widely. In agriculture, researchers develop and test better seeds,
equipment, and farming methods. In technology, in engineering, in field
after field, progress comes from research and development. Physicians,
farmers, consumers, and government officials base key decisions on the
results of rigorous research.
In education reform, on the other hand, research has played a
relatively minor role. Untested innovations appear, are widely
embraced, and then disappear as their unrealistic claims fail to
materialize. We then replace them with equally untested innovations
diametrically opposed in philosophy, in endless swings of the reform
pendulum. Far more testing goes into our students' hair gel and acne
cream than into most of the curriculums or instructional methods
teachers use. Yet which of these is more important to our students'
future?
Evidence-Based Reform
At long last, education reform may be entering an
era of well-researched programs and practices (Slavin, 2002). The U.S.
government is now interested in the research base for programs that
schools adopt. The Comprehensive School Reform Demonstration
legislation of 1997 gives grants to schools to adopt “proven,
comprehensive” reform designs. Ideally, “proven” means that programs
have been evaluated in “scientifically based research,” which is
defined as “rigorous, systematic, and objective procedures to obtain
valid knowledge” (U.S. Department of Education, 1998). The emphasis is
on evaluations that use experimental or quasi-experimental designs,
preferably with random assignment. The Bush administration's No Child
Left Behind Act mentions “scientifically based research” 110 times in
references to Reading First programs for grades K-3, Early Reading
First for preK, Title I school improvement programs, and many more. In
each case, schools, districts, and states must justify the programs
that they expect to implement under federal funding.
Judging the Validity of Education Research
The new policies that base education funding and
practice on scientifically based, rigorous research have important
consequences for educators. Research matters. Educators have long given
lip service to research as a guide to practice. But increasingly, they
are being asked to justify their choices of programs and practices
using the findings of rigorous, experimental research.
Why is one study valid whereas another is not?
There are many valid forms of research conducted for many reasons, but
for evaluating the achievement outcomes of education programs, judging
research quality is relatively straightforward. Valid research for this
purpose uses meaningful measures of achievement to compare several
schools that used a given program with several carefully matched
control schools that did not. It's that simple.
Control Groups
A hallmark of valid, scientifically based research
on education programs is the use of control groups. In a good study,
researchers compare several schools using a given program with several
schools not using the program but sharing similar demographics and
prior performance, preferably in the same school district. Having at
least five schools in each group is desirable; circumstances unique to
a given school can bias studies with just one or two schools in each
group.
A control group provides an estimate of what
students in the experimental program would have achieved if they had
been left alone. That's why the control schools must be as similar as
possible to the program schools at the outset.
Randomized and Matched Experiments
The most convincing form of a control group
comparison is a randomized experiment in which students, teachers, or
schools are assigned by chance to a group. For example, the principals
and staffs at ten schools might express interest in using a given
program. The schools might be paired up and then assigned by a coin
flip to the experimental or control group.
Randomized experiments are very rare in education,
but they can be very influential. Perhaps the best known example in
recent years is the Tennessee class size study (Achilles, Finn, &
Bain, 1997/1998) in which researchers assigned students at random to
small classes (15 students), regular classes (20–25 students), or
regular classes with an aide. The famous Perry Preschool Program
(Berrueta-Clement, Schweinhart, Barnett, Epstein, & Weikart, 1984)
assigned four-year-olds at random to attend an enriched preschool
program or to stay at home. Two recent studies of James Comer's School
Development Project randomly assigned schools to use the School
Development Project or keep using their current program (Cook et al.,
1999; Cook, Murphy, & Hunt, 2000). In each of these studies, random
assignment made it very likely that the experimental and control groups
were identical at the outset, so any differences at the end were sure
to have resulted from the program.
Matched studies are far more common than randomized
ones. In a matched program evaluation, researchers compare students in
a given program with those in a control group that is similar in prior
achievement, poverty level, demographics, and so on. Matched studies
can be valid if the experimental and control groups are very similar.
Often, researchers use statistical methods to “control for” pretest
differences between experimental and control groups. This can work if
the differences are small, but if there are large differences at
pretest, statistical controls or use of test-gain scores (calculated by
subtracting pretest scores from posttest scores) are generally not
adequate.
The potential problem with even the best matched
studies is the possibility that the schools that chose a given program
have (unmeasured) characteristics that are different from those that
did not choose it. For example, imagine that a researcher asked 10
schools to implement a new program. Five enthusiastically take it on
and five refuse. Using the refusal group as a control group, even if it
is similar in other ways, can introduce something called selection
bias. In this example, selection bias would work in favor of finding a
positive treatment effect because the volunteer schools are more likely
to have enthusiastic, energetic teachers willing to try new methods
than are the control schools. In other cases, however, the most
desperate or dysfunctional schools may have chosen or been assigned to
a given program, giving an advantage to the control schools.
Is Random Assignment Essential?
Random assignment to experimental and control
groups is the gold standard of research. It virtually eliminates
selection bias because students, classes, or schools were assigned to
treatments not by their own choice but by the flip of a coin or another
random process.
Because randomized studies can rule out selection
bias, the U.S. Department of Education and many researchers and
policymakers have recently been arguing for a substantial increase in
the use of randomized designs in evaluations of education programs.
Already, more randomized studies are under way in education than at any
other point in history.
The only problem with random assignment is that it
is very difficult and expensive to do, especially for schoolwide
programs that necessitate random assignment of whole schools. No one
likes to be assigned at random, so such studies often have to provide
substantial incentives to get educators to participate. Still, such
studies are possible; we have such a study under way to evaluate our
Success for All comprehensive reform model, and, as noted earlier,
Comer's School Development Program has been evaluated in two randomized
studies.
At present, with the movement toward greater use of
randomized experiments in education in its infancy, educators
evaluating the research base for various programs must look carefully
at well-matched experiments, valuing those that try to minimize bias by
using closely matched experimental and control groups, having adequate
numbers of schools, avoiding comparing volunteers with nonvolunteers,
and so on.
Statistical and Educational Significance and Sample Size
Reports of education experiments always indicate
whether a statistically significant difference exists between the
achievement of students in the experimental group and those in the
control group, usually controlling for pretests and other factors. A
usual criterion is “p < 0.05,” which means that the probability is
less than 5 percent that an observed difference might have happened by
chance.
The proportion of students within a program getting
“significantly higher” scores than those in a control group is
important, but it may not be important enough. In a large study, a
small difference could be significant. A typical measure of the size of
a program effect is “effect size,” the experimental-control difference
divided by the control group's standard deviation (a measure of the
dispersion of scores). In education experiments, an effect size of
+0.20 (20 percent of a standard deviation) is often considered a
minimum for significance; effect sizes above +0.50 would be considered
very strong.
But student groupings can have a profound impact on
student outcomes. Often, an experiment will compare one school using
Program X with one matched control school. If 500 students are
in each school, this is a very large experiment. Yet the difference
between the Program X school and the control school could be due to any number of factors that have nothing to do with Program X. Perhaps the Program X
school has a better principal or a cohesive group of teachers or has
been redistricted to include a higher-performing group of students.
Perhaps one of the schools experienced a disaster of some sort—in an
early study of our Success for All program, Hurricane Hugo blew the
roof off of the Success for All school but did not affect the one
control school.
Because of the possibility that something unusual
that applies to an entire school could affect scores for all the
students in that school, statisticians insist on using the school's
means, not individual student scores, in their analyses. In this way,
individual school factors are likely to balance out. Statistical
requirements would force a researcher to have at least 20–25 schools
in each condition. Very few education experiments are this large,
however, so the vast majority of experiments analyze at the student
level.
Readers of research must apply a reasonable
approach to this problem. We should view studies that observe a single
school or class for each condition with great caution. However, a study
with as many as five program schools and five control schools probably
has enough schools to ensure that a single unusual school will not skew
the results. Such a study would still use individual scores, not school
means, but it would be far preferable to a comparison between only two
schools.
A single study involving a small number of schools
or classes may not be conclusive in itself, but many such studies,
preferably done by many researchers in a variety of locations, can add
confidence that a program's effects are valid. In fact, experimental
research in education usually develops in this way. Rather than
evaluate one large, definitive study, researchers must usually look at
many small studies that may be flawed in various (unbiased) ways. But
if these studies tend to find consistent effects, the entire set of
studies may produce a meaningful conclusion.
Research to Avoid
All too often, program developers or advocates cite
evidence that is of little value or that is downright misleading. A
rogue's gallery of such research follows.
Cherry Picking
Frequently, program developers or marketers report
on a single school or a small set of schools that made remarkable gains
in a given year. Open any education magazine and you'll see an ad like
this: “Twelfth Street Elementary went from the 20th percentile to the
60th in only one year!” Such claims have no more validity than
advertisements for weight loss programs that tell the story of one
person who lost 200 pounds (forgetting to mention the hundreds who did
not lose weight on the diet). This kind of “cherry picking” is easy to
do in a program that serves many schools; there are always individual
schools that make large gains in a given year, and the marketer can
pick them after the fact just by looking down a column of numbers to
find a big gainer. (Critics of the program can use the same technique
to find a big loser.) Such reports are pure puffery, not to be confused
with science.
Bottom Fishing
A variant of cherry picking is “bottom fishing,”
using an after-the-fact comparison in which an evaluator compares
schools using a given program with matched “similar schools” known to
have made poor gains in a given year. Researchers can legitimately
compare gains made in program schools and gains made in the entire
district or state because the large comparison group makes “bottom
fishing” impossible. However, readers should interpret with caution
after-the-fact studies purporting to compare groups selected by the
evaluator.
Pre–Post Studies
Another common but misleading design is the
pre–post comparison, lacking a control group. Typically, the designer
cites standardized test data, with the rationale that the expected
year-to-year gain in percentiles, normal curve equivalents, or percent
passing is zero, so any school that gained more than zero has made good
progress.
The problem with this logic is that many states and
districts make substantial gains in a given year, so the program
schools may be doing no better than other schools. In particular,
states usually make rapid gains in the years after they adopt a new
test. At a minimum, studies should compare gains made in program
schools in a given district or state with the gains made in the entire
district or state.
Scientifically Based Versus Rigorously Evaluated
A key issue in the recent No Child Left Behind
legislation is the distinction between programs that are “based on
scientifically based research” and those that have been evaluated in
valid scientific experiments. A program can be “based on scientifically
based research” if it incorporates the findings of rigorous
experimental research. For example, reading programs are eligible for
funding under the federal Reading First initiative if states determine
that they incorporate a focus on five elements of effective reading
instruction: phonemic awareness, phonics, fluency, vocabulary, and
comprehension. The National Reading Panel (1999) identified these
elements as having been established in rigorous research, especially in
randomized experiments. Yet there is a big difference between a program
based on such elements and a program that has itself been
compared with matched or randomly assigned control groups. We can
easily imagine a reading program that would incorporate the five
elements but whose training was so minimal that teachers did not
implement these elements well, or whose materials were so boring that
students were not motivated to study them.
The No Child Left Behind guidance (U.S. Department
of Education, 2002) recognizes this distinction and notes a preference
for programs that have been rigorously evaluated, but also recognizes
that requiring such evaluations would screen out many new reading
programs that have not been out long enough to have been evaluated, and
so allows for their use. This approach may make sense from a pragmatic
or political perspective, but from a research perspective, a program
that is unevaluated is unevaluated, whether or not it is “based on”
scientifically based research. A basis in scientifically based research
makes a program promising, but not proven.
Research Reviews
In order to judge the research base for a given
program, it is not necessary that every teacher, principal, or
superintendent carry out his or her own review of the literature.
Several reviews applying standards have summarized evidence on various
programs.
For comprehensive school reform models, for
example, the American Institutes for Research published a review of 24
programs (Herman, 1999). The Thomas Fordham Foundation (Traub, 1999)
commissioned an evaluation of 10 popular comprehensive school reform
models. And Borman, Hewes, Rachuba, and Brown (2002) carried out a
meta-analysis (or quantitative synthesis) of research on 29
comprehensive school reform models.
Research reviews facilitate the process of
evaluating the evidence behind a broad range of programs, but it's
still a good idea to look for a few published studies on a program to
get a sense of the nature and quality of the evidence supporting a
given model. Also, we should look at multiple reviews because
researchers differ in their review criteria, conclusions, and
recommendations. Adopting a program for a single subject, much less for
an entire school, requires a great deal of time, money, and work—and
can have a profound impact on a school for a long time. Taking time to
look at the research evidence with some care before making such an
important decision is well worth the effort. Accepting the developer's
word for a program's research base is not a responsible strategy.
How Evidence-Based Reform Will Transform Our Schools
The movement to ask schools to adopt programs that
have been rigorously researched could have a profound impact on the
practice of education and on the outcomes of education for students. If
this movement prevails, educators will increasingly be able to choose
from among a variety of models known to be effective if well
implemented, rather than reinventing (or misinventing) the wheel in
every school. There will never be a guarantee that a given program will
work in a given school, just as no physician can guarantee that a given
treatment will work in every case. A focus on rigorously evaluated
programs, however, can at least give school staffs confidence that
their efforts to implement a new program will pay off in higher student
achievement.
In an environment of evidence-based reform,
developers and researchers will continually work to create new models
and improve existing ones. Today's substantial improvement will soon be
replaced by something even more effective. Rigorous evaluations will be
common, both to replicate evaluations of various models and to discover
the conditions necessary to make programs work. Reform organizations
will build capacity to serve thousands of schools. Education leaders
will become increasingly sophisticated in judging the adequacy of
research, and, as a result, the quality and usefulness of research will
grow. In programs such as Title I, government support will focus on
helping schools adopt proven programs, and schools making little
progress toward state goals may be required to choose from among a set
of proven programs.
Evidence-based reform could finally bring education
to the point reached early in the 20th century by medicine,
agriculture, and technology, fields in which evidence is the lifeblood
of progress. No Child Left Behind, Reading First, Comprehensive School
Reform, and related initiatives have created the possibility that
evidence-based reform can be sustained and can become fundamental to
the practice of education. Informed education leaders can contribute to
this effort. It is ironic that the field of education has embraced
ideology rather than knowledge in its own reform process.
Evidence-based reform honors the best traditions of our profession and
promises to transform schooling for all students.
References
Achilles, C. M., Finn, J. D., & Bain, H. P. (1997/1998). Using class size to reduce the equity gap. Educational Leadership, 55(4), 40–43.
Berrueta-Clement, J. R., Schweinhart, L. J., Barnett, W. S., Epstein, A. S., & Weikart, D. P. (1984). Changed lives. Ypsilanti, MI: High/Scope.
Borman, G. D., Hewes, G. M., Rachuba, L. T., & Brown, S. (2002). Comprehensive school reform and student achievement: A meta-analysis. Submitted for publication. (Available from the author at
gborman@education.wisc.edu)
Cook, T. D., Habib, F., Phillips, M., Settersten,
R. A., Shagle, S., & Degirmencioglu, M. (1999). Comer's school
development program in Prince George's County, Maryland: A theory-based
evaluation. American Educational Research Journal, 36(3), 543–597.
Cook, T., Murphy, R. F., & Hunt, H. D. (2000). Comer's school development program in Chicago: A theory-based evaluation. American Educational Research Journal, 37(2), 543–597.
Herman, R. (1999). An educators' guide to schoolwide reform. Arlington, VA: Educational Research Service.
National Reading Panel. (1999). Teaching children to read. Washington, DC: U.S. Department of Education.
Slavin, R. E. (2002). Evidence-based education policies: Transforming educational practice and research. Educational Researcher, 31(7), 15–21.
Traub, J. (1999). Better by design? A consumer's guide to schoolwide reform. Washington, DC: Thomas Fordham Foundation.
U.S. Department of Education. (1998). Guidance on the comprehensive school reform demonstration program. Washington, DC: Author.
U.S. Department of Education. (2002). Draft guidance on the comprehensive school reform program (June 14, 2002 update). Washington, DC: Author.