Bad advocacy research abounds on school reform

DC school reform was a failure, claims a new report from the Economic Policy Institute (EPI). It's a proven success, others insist. All sides of school reform debates are guilty of misinterpreting federal test data in ways that serve advocacy goals rather than finding truth.

Photo by HikingArtist on Flickr.

The EPI report blasts recent DC's sweeping 2007 school reforms and similar efforts in Chicago and New York City. One of the report's most amazing claims is that school reform in DC actually lowered student test scores and increased achievement gaps. It reaches that conclusion through a flawed analysis of National Assessment of Educational Progress (NAEP) test scores.

They're not the only offenders. In January, the Washington Post editorial board assured readers that despite alleged cheating on the DC CAS, NEAP data demonstrates that school reform has succeeded. A letter to the editor the next day from Alan Ginsburg, director of policy at the Department of Education from 2002-2010, argued that NAEP shows the exact opposite.

Beware of arguments that use NAEP to defend or attack policies like charter expansion or teacher layoffs. The reality is that NAEP is not meant for this purpose. You will not find typical peer-reviewed research drawing such conclusions from NAEP data, because it's a fairly well known error that's been widely discredited.

I have decided this needs its own term: "misnaepery."

What is wrong with using NAEP data in this way?

NAEP is the test given to a random sample of students in grades 4, 8, and 12 across the country. It's designed to gauge long-term trends in student academic proficiency. It doesn't look at how a fixed group of students learns over time.

Each test looks at a different set of students from the one before. Those who take the test one year in grade 4 are usually in grade 5 the next year, where they won't take the test. Those still in grade 4 wouldn't necessarily be in the random sample again anyway.

A test that looks at different groups of students in different points in time ("trend" or "repeated cross-section" measures) doesn't clearly tell you whether a school is doing better at educating those students, because they are different students. Maybe the demographics of the neighborhood or city changed. Maybe some moved to or from charter schools.

The 8th grade NAEP is measuring not what that middle or junior high school has done since a previous group of students took the NAEP, but the effect of everything those students did up to grade 8. If something changed in the district's kindergarten 9 years prior, that would affect the scores of 8th graders who entered kindergarten before and after the change.

These shifts are called "cohort changes." In short, when you measure a group of students and then a different group of students at another time, the second group could be very different for many reasons. I wrote a more technical paper about this if you want to see a more mathematical analysis of the bias inherent in these types of measures.

In the case of DC school reform, misnaepery is especially inexcusable because a panel of experts from the National Academy of Sciences specifically warned that the NAEP does not provide causal evidence on the DC reforms' impact. The EPI report's authors may be right that reform proponents made exaggerated claims that reform was successful when test scores rose. But making even more exaggerated claims in the other direction is the wrong response.

We need better data and more objective research

Instead, we must be humble about what can be learned from existing data. We must also invest in better data and more focused data-gathering efforts. Instead of repeated cross-sections, we need longitudinal "growth measures," where you take a group of students who were exposed to a policy (and ideally others who were not) and follow those same students over time.

The NAS experts in 2011 recommended a set of metrics, mostly longitudinal, that DC could use to evaluate school reform policies. That would help, though it wouldn't entirely prove reform worked or didn't, unless there were another group of kids who didn't benefit from reform at all to serve as a control group.

Better data would also help estimate the impacts of specific, replicable reforms, rather than trying to settle a pointless debate about whether the broad suite of DC education reforms as a whole were collectively good or collectively bad.

Some researchers do use data intelligently to answer focused questions about specific changes, such as this paper from last summer about school closures.

To improve DC education, we need purposeful experiments that try out promising practices and then collect the data to evaluate them. We need to collect more useful data and to recognize the limitations of the data we have. Researchers have a responsibility to their audiences to not oversell what existing data can tell us.

Steven Glazerman is an economist who studies education policy and specializes in teacher labor markets. He has lived in the DC area off and on since 1987 and settled in the U Street neighborhood in 2001. He is a Senior Fellow at Mathematica Policy Research, but any of his views expressed here are his own and do not represent Mathematica. 


Add a comment »

Thank you so much for this post. It is infuriating to read the headline grabbing claims made by both sides and then discovering little, if any, justification for those claims. I think it is especially telling when "anti" education reformers rail against any value in testing students but paradoxically rely on limited testing data to "prove" the reformers have failed. At the same time, a quality evaluation of the progress of reform cannot begin and end with days of filling in ovals, even if those ovals are comparable data set to data set.

by TM on Apr 30, 2013 12:38 pm • linkreport

I so appreciate this and wholeheartedly agree. I would like to know how I can promote this as someone who would like to see policy decisions made based on appropriate, quality data. It seems like policy makers use data as little more than set dressing and it frustrates me, but I don't know how to encourage more useful data collection without changing careers and collecting it myself. Ideas?

by Erin on Apr 30, 2013 12:45 pm • linkreport

Better data can be expensive. Given the dollars devoted to education, it makes budgetary sense to collect more data. However, there is also a cost to students in time devoted to testing.

You seem especially interested in longitudinal data, following a particular student over time. Can this can be done with data already collected, students are tested in consecutive grades for example. The burden of collecting additional longitudinal data would necessarily fall disproportionately on a subset of students. How accessible is the current testing data? Is it stored in a way that it's possible to track students over time? Other than testing data, what type of data would be valuable?

In general, I'm in favor of making data more accessible when possible, allowing good research to drive out bad research, but clearly there are serious privacy concerns when it comes to student test scores.

by SE on Apr 30, 2013 12:58 pm • linkreport

you almost wonder if there wasn't a way other than direct testing to find out whether kids are learning things in their classes. As in a homework assignment in all DCPS 5th grade English classes that asks for 200 written words on any topic that could be analyzed for grade-level writing. Something that indicates level of skill and interest in learning. I have no idea if this works - but thinking outside of the box, given that testing with consequences inevitably leads to preparing for the consequences.

by andy on Apr 30, 2013 2:55 pm • linkreport

I'm in favor of using numbers and stats to improve education, but not when they're twisted into reaching improper conclusions. Thanks for the clarification on this data.

By the way...what conclusions could we draw from the NAEP data?

by Jessica Christy on Apr 30, 2013 6:42 pm • linkreport

In response to some of the comments above I would add a couple of points.

First, the good news: We do have longitudinal data. The DC-CAS has been administered every year to students in grades 3 through 8, and DCPS and OSSE have worked hard to ensure that students can be tracked over time and across schools and school sectors (district and charter) and linked to their specific teachers.

Also, researchers in the DC area have been working through a consortium to use these data appropriately to evaluate DC school reform. The consortium is call DC-EdCORE (Education Consortium on Research and Evaluation) and is housed in the George Washington University's Graduate School of Education and Human Development. Consortium members, like my organization (Mathematica) as well as the American Institutes for Research, the Rand Corporation, and Policy Studies Associates, have access to DC-CAS and other data that can shed light on the city's questions about school reform. In addition to DC EdCORE, other researchers at institutions like Carnegie Mellon and the University of Virginia are conducting rigorous research into education policy issues using data from DC. It's important for city leaders and philanthropists to tap into these resources to make sure high stakes decisions are well informed by data and sophisticated analysis.

Now the bad news, DC EdCORE and the others focused on DC have limited funding and, even with their access to the best data in existence, are limited by the lack of a meaningful comparison group. The problem that anyone faces evaluating DC's sweeping school reform is not a lack of longitudinal data but the fact that everything changed at once. It would be impossible to reconstruct the outcomes that would have been realized had the 2007 law never been passed or to know what the "active ingredient" is in its success, if any.

Instead we should be varying policies incrementally going forward, conducting pilots and experiments to test the efficacy of proposed changes before they are brought to scale. There are many important outstanding questions that could benefit from some careful data collection and analysis.

by Steven Glazerman on May 1, 2013 12:00 am • linkreport

@SE How accessible is the current testing data? Is it stored in a way that it's possible to track students over time? Other than testing data, what type of data would be valuable?

In general, I'm in favor of making data more accessible when possible, allowing good research to drive out bad research, but clearly there are serious privacy concerns when it comes to student test scores.

Yes we have the data and it's longitudinal, but we have to re-obtain permission from DCPS and/or OSSE every time we use it to justify the specific purpose. We have extremely tight security requirements that are designed to avoid even the risk of disclosure of sensitive personal information. This is standard practice and is required of any organization that is granted permission to use individually identifiable data.

by Steven Glazerman on May 1, 2013 12:04 am • linkreport

@Jessica Christy What can we conclude from the NAEP data? Good question. NAEP data tell us how students are doing on average at a point in time. If NAEP scores are higher than they have been, it could reflect long-term changes in student demographics as well as changes in the quality of instruction. You just can't tell which one.

In other words, NAEP is good at telling us how kids are doing, but not so good at telling us how adults are doing at teaching kids.

by Steven Glazerman on May 1, 2013 12:10 am • linkreport

But Steven, if NAEP scores were going up at a faster rate during the Vance and Janey years of leading DCPS, and then the improvements slowed during the Rhee/Henderson years. And if during the Rhee/Henderson years there was a demographic shift bringing more white and privileged students back into the school system, is it not possible to conclude that the combination of the demographic shift and the trend in NAEP scores at least raise serious questions about the strategies being pursued? It may be true that the NAEP scores by themselves are not dispositive, but aren't you overstating the case for discounting them? It is also curious that you waited until now to raise this critique, when for years the Washington Post editors and DCPS itself have been parading NAEP as proof of success. Where was your critique then?

by Mark Simon on May 1, 2013 9:28 am • linkreport


People have done some pretty clever things with NAEP to tease out information that is relevant for policy. For example, they have used NAEP to create a mapping of one state's longitudinal testing system onto another's, but I still think NAEP is too limited for making causal claims about DC school reform's impact on student achievement, at least not without external data on test-takers' socio-economic background and other factors like English proficiency and special needs. There are likely some impacts of school reform on enrollment dynamics, neighborhood composition, and migration patterns into and out of the city, but again, NAEP is not really the data source I would use to study that phenomenon.

I've tried to provide GGW (and now GGE) with some critical comments on reports that are likely to influence policymakers in DC, especially long reports where the conclusions will be read by many, but the details by few. For example, I was very critical of a report issued by an organization that is associated with the charter movement. I don't draft critiques of every data-based claim made in every editorial, though. I wish I had the time. This is a volunteer effort. But my issue with the EPI report was motivated by the methods, not the conclusions.

by Steven Glazerman on May 1, 2013 1:05 pm • linkreport

Steven, thank you for your time and this level of detail. It is enormously helpful and interesting to see something so commonly cited be thoughtfully deconstructed.

What this highlights for me, though it comes as no surprise, is the damaging effect that the desire for politically expedient answers can have on the effort to develop and share realistic analysis of complex systems.

by Katherine Mereand-Sinha on May 1, 2013 2:03 pm • linkreport

Is Glazerman saying that the DC tests capture gains while the NAEP tests do not? Why can't year to year NAEP changes show gains, albeit admittedly with samples of different students, and admittedly with measurable sampling error?

But if the DC test scores are tainted by cheating and curricular narrowing, the limits of which can't be known due to politics and limited investigative resources, then both tests are flawed, and then the solution would be to produce reports that generously acknowledge the limitations.

by Andrea on May 8, 2013 7:44 pm • linkreport

Steve, are the DC-EdCORE tests free from the limitations plausibly alleged to characterized other DC data, namely, that high stakes tests have been associated with and corrupted testing proceses?

The EPI study raised a wide variety of interseting points, not just empirical by also in terms of how the empirical evidence relates to the research-based context of how "high stakes" can affect the operations of school functioning in very mixed ways. An example is the report that the principlal under Rhee, the ones *she hired, left in droves for some mysterious reason.

I am with Mark in that citing possible demographic changes is a hollow criticism if the demographic changes were in the direction of greater privilege not less.

The limitations of a dataset ought not be ground for full out nihilism regarding its uses; for without alternative data, it is hard or impossible to investigate the tendencey of politicians to say all is hunky dory. Whatever EdCore has to offer, I have the feeling it has not offered much insight to date, perhaps due to the issues you mentioned.

The case of the school during Rhee that claimed 40 point gains that magically vanished under the next principal are very concerning.

And this question is for anyone. Is the Washington Post article credible? What are its conclusions based on? How does the Post get around the problem that the extent of the cheating or curricular narrowing are almost surely more likely underestimated than overestimated due to the procedurally understandable assumption of innocence, limited resources, political limitations.

by Andrea on May 8, 2013 8:28 pm • linkreport

Add a Comment

Name: (will be displayed on the comments page)

Email: (must be your real address, but will be kept private)

URL: (optional, will be displayed)

You can use some HTML, like <blockquote>quoting another comment</blockquote>, <i>italics</i>, and <a href="http://url_here">hyperlinks</a>. More here.

Your comment:

By submitting a comment, you agree to abide by our comment policy.
Notify me of followup comments via email. (You can also subscribe without commenting.)
Save my name and email address on this computer so I don't have to enter it next time, and so I don't have to answer the anti-spam map challenge question in the future.