Allegations on test scores miss the whole story

DC Council Education Committee Chair David Catania has alleged that testing officials inflated the percentage of students reported as "proficient" on standardized tests given earlier this year.

David Catania. Photo from the DC Council website.

Officials say they were just trying to ensure this year's scores could be compared with those from previous years. But according to multiple sources, the real story has to do with inappropriate questions from DC's testing vendor.

In recent years DC schools have begun teaching more rigorous content aligned with the Common Core standards, and this past school year students took a revised version of the DC CAS designed to test that content. Because the test had changed, some OSSE officials were working with the District's testing vendor, CTB McGraw Hill, to change the grading scale. The new scale would have used different minimum scores, or "cut scores," to define levels like "proficient" and "advanced."

In June of this year, responsibility for testing was transferred from the Director of Assessments to Director of Data Management Jeff Noel. Noel says he was surprised to learn that testing officials had expected to implement the new grading scale this year.

Using the new scale would have made it impossible to compare this year's proficiency rates to the levels reported in previous years, a fact no one outside OSSE had been made aware of. Only 6 days after taking over responsibility for testing, Noel decided to switch back to the old grading scale, with the support of others at OSSE.

Catania alleged in a hearing last Thursday that OSSE chose to switch to the old grading scale at the last minute in order to ensure gains in both reading and math. If OSSE had used the new grading scale, with its new cut scores, math scores would have been lower, by 3.6 points, and reading scores would have been higher, by 6.6 points.

These scores would not have been comparable to previous scores, since the grading scales were different. But observers might have missed that point. When other states, like New York, adopted new Common Core-aligned grading scales, they saw dramatic drops in scores. These states made it clear that comparisons to previous years would not be possible, but the decline in scores led to a public outcry nonetheless.

Test vendor's question

At the hearing Catania accused OSSE officials of manipulating the grading scale to produce gains in both reading and math scores—gains that Mayor Vincent Gray declared "historic" when they were released in July.

But two people involved in the process told Greater Greater Education that Noel was also concerned about the test vendor's approach to setting new cut scores. Those individuals said that at a June 17 meeting, a CTB executive asked OSSE officials: "What growth do you think makes sense for the state?"

In addition, CTB gave OSSE a form to guide the cut score process that allowed officials to explicitly indicate where the scores would be expected to end up. Choosing lower cut scores would have allowed them to report greater improvement in proficiency rates.

DC CAS Reflection Form provided by CTB to OSSE

The two individuals, who asked to remain anonymous, said that the decision to return to the old cut scores was partly motivated by concerns about CTB's process and a desire to distance OSSE from it. CTB spokesperson Brian Belardi said CTB McGraw Hill has no comment.

Catania also made another allegation, with some justification: OSSE didn't reveal that even under the grading scale it ended up using, this year's scores are not as comparable to prior years' scores as they have been in the past. In most years the content covered by the test is the same as in previous years, but between 2012 and 2013 some of the content changed. Emily Durso, interim State Superintendent of Education, said that OSSE's failure to mention this qualification was simply an oversight.

While the true motives of OSSE officials in switching back to the old grading scale may be different than those Catania alleged, they are no less concerning.

Catania told the Washington Post that this controversy has him questioning whether so many high-stakes decisions should be made based on scores involving so much "subjectivity." Many advocates have been saying as much for years.

Others have argued that the problem isn't high-stakes testing generally, but rather that fundamental changes are necessary to restore faith in the testing system.

OSSE needs more independence

The first of these changes would be to give OSSE independence from the Mayor, similar to the autonomy conferred on DC's Chief Financial Officer. In fact, Catania has proposed legislation providing that the Mayor could dismiss the State Superintendent only for cause, like the CFO.

Mayor Gray is the only head of a public school system who also hires and fires the state superintendent in charge of testing and auditing the schools. Some observers feel that the Gray administration must have pressured OSSE to switch to the old grading system, although there's no hard evidence to support that conclusion.

Even if no such pressure was applied, the testing vendor's apparent willingness to base scoring decisions on expected improvements in proficiency rates creates a temptation that must be isolated from political officials.

The CAS controversy also demands a fresh look at the testing measures we rely on. It's precisely because proficiency metrics are so subjective that they are unreliable and open to political manipulation.

Measure growth, not "proficiency"

Instead of focusing on the "percent proficiency" metrics that are at the heart of this controversy, we need to use measures of growththe ability of a classroom teacher to increase students' educational attainment.

When test results are based on a proficiency cut score, they indicate the percentage of test-takers who scored higher than that minimum. The advantage of this approach is that it gives the public a sense of what an acceptable score is. But it tends to magnify small changes and reveals little about changes in scores that are either well above or well below "proficiency."

Tests that are scored for "growth," on the other hand, use averages based on all participants' scores and compare them to previous years' averages. This method allows all changes throughout the test-taking group to be reflected in the final results. It's also objective: the calculation doesn't require making any year-to-year judgment calls about how to interpret the results.

Measures of growth would also highlight varying changes at different skill levels. Teachers would have an incentive to raise all scores, not simply to push students who are slightly below proficiency to being slightly above.

OSSE actually does report a metric of growth, or value added, known as MGP. But, with the notable exception of the IMPACT system for assessing teachers, few evaluations are based on such scores. Principals, for example, are assessed on their ability to increase percent-proficient numbers, even though such numbers can be impacted more by demographic changes or students transferring between schools than instructional quality. And DCPS, OSSE, and the Public Charter School Board continue to present "percent proficiency" numbers most prominently in school profiles, leading parents to believe these are the best indicators of school quality.

Do you think the credibility of the testing regime in DC can be restored through these changes? Or do you think the problem is high-stakes testing itself, and that test scores should figure less prominently in school decisions?

Ken Archer is CTO of a software firm in Tysons Corner. He commutes to Tysons by bus from his home in Georgetown, where he lives with his wife and son. Ken completed a Masters degree in Philosophy from The Catholic University of America. 
Rahul Mereand-Sinha was born in DC and grew up nearby in Bethesda. He now lives in Kalorama Triangle with his wife Katherine. He has a Masters of Public Policy from the University of Maryland and moonlights as a macroeconomist. 


Add a comment »

Under the current percent proficient method, a student who gains 2 points and crosses the threshold into proficient is celebrated more than a student who gained 15 points but remained in the basic category. Something about that just seems wrong.

by Jessica Christy on Oct 3, 2013 1:48 pm • linkreport

Actually, test scores and how they're used/misused is the really problem here. Group administered tests are problematic,in part pbecause they are easy to game and even under the best of circumstances, their administration is more prone to random variance than individually administered tests, which makes the scores less reliable The small changes from year to year that evoke headlines and pearl clutching are usually not statistically significant and basically meaningless. Also, tehse tests were never meant to evaluate programs, although trhey are used this way and many, such as those developed for NCLB were developed in haste by people who son't know much about evaluating pedagogy (FWIW, on e of the main developers, supervised part of my postdoc, a brilliant man out of his depth for this purpose).

In previous, less fevered times, new curilla were implemented gradually over a period of time. Usually, they began with teachers who were considered good exemplars of "early adoption": respected, open minded and viewed as competent. Now, there's an effort to put a new curicullum in place in a year. The important metrics are the quality of the research behind it (the standards seem more consenus than emprically based in the strictest sense) and how much fidelity there is in the implementation. The latter is more art than science and requires strong, competent administration. I suspect DCPL lacks that kind of administration and suspect that even the tiffany districts are mixed, at best, in that department. Administration, other than superindentents and, occasionally, principals mostly get little notice from people concerned with the ever popular question of "what's wrong with our schools?". Superinetendents and teachers (not to mention their unions) tend to be the bigger targets in all of this, although not always the most appropriate.

Most likely, implementation has been variable at best. Monitoring probably has been spotty. Evaluation probably has been non-existent. The upper management's cocnern probably has been on "looking good" to various stakeholders. Of course, the tests are going to be gamed in some way. They're utterly inappropriate to the task, but they're what parents, politicians, and gadflys rely on, even if they have no iudea what how to use them.

by Rich on Oct 3, 2013 7:53 pm • linkreport

Under the current percent proficient method, a student who gains 2 points and crosses the threshold into proficient is celebrated more than a student who gained 15 points but remained in the basic category. Something about that just seems wrong.

Any subjective metric like proficiency rates can be gamed - teachers can game it as you mention, and testing officials can game it as our testing vendor appeared to have offered to do.

That's why we must (1) STOP reporting proficiency rates and instead report growth metrics, and (2) firewall this process from political officials.

by Ken Archer on Oct 4, 2013 10:18 am • linkreport

It's nice to see a city councilmember take education seriously. The rest of them are happy to deflect this issue to toothless state boards. David Catania should consider running for mayor...he'd very possibly win.

by MikeR on Oct 4, 2013 11:55 am • linkreport

David Catania should consider running for mayor...he'd very possibly win.

Seriously? If there's one thing to be noted from this piece, it's that Catania has no clue what he's talking about and just likes to fly off the handle at the suggestion that the school system has changed something without his permission.

It's obvious that the reason they wanted to use the old grading system is so that they wouldn't have the misinformed, math/stats-deficient reporting corps and school critics in DC screaming about how test scores had gone down. And Catania managed to basically prove their point with his criticism.

by MLD on Oct 4, 2013 12:18 pm • linkreport

If there's one thing to be noted from this piece, it's that Catania has no clue what he's talking about and just likes to fly off the handle at the suggestion that the school system has changed something without his permission.

While I think CM Catania went too far in personally criticizing Dir of Data Mgmt Jeff Noel at OSSE, he made several important points.

He was absolutely correct that OSSE should have put caveats in their reports about the limited comparability of 2012 and 2013.

And he was also absolutely correct in chastising the Chief of Staff and Superintendent for not understanding or being remotely involved in or concerned with the cut score setting process. We need to start putting actual education policy people into the role of Superintendent (like Deb Gist was) and Chief of Staff, not mayoral loyalists and managers.

by Ken Archer on Oct 4, 2013 12:33 pm • linkreport

I wrote to 11 city officials and ONE cared about the fact that kids could be removed from schools with no ID and no concern. David was the only one who cared. Who else is talking about education? That thief and liar Michelle Rhee? Our useless State Board of Ed?

Here's the kicker.... David is liked east of the river. Health care folks... people are clamoring for health care. I've never worked on a Democratic campaign, but I'd toss everything in to elect him... And there are plenty of others like me.

by MikeR on Oct 4, 2013 12:36 pm • linkreport

I agree with MLD here 100%. Well said. CM Catania did a cannonball into the deep end of the pool without his floaties on. OSSE used the cut scores that are *more* comparable with prior standards, not less. His pearl-clutching hearing performance was pathetic. The subsequent release of the alternative proficiency levels illustrated how trivial the whole issue is that he got so worked up about.

On the other hand, I was blown away by the revelation in this blog of how the test vendor actually "helps" its customer, the state agency, set cut scores. Yikes. I didn't trust proficiency levels or trends much before, now I really don't like them.

by Ward 1 Guy on Oct 4, 2013 5:17 pm • linkreport

Ken, you advocate annual growth in test scores as a prefered methodology. You write:
"Tests that are scored for "growth," on the other hand, use averages based on all participants' scores and compare them to previous years' averages."

But the methodology you prefer compares this year's students to last year's students for the same teacher or school -- same teacher, different students. And students are not randomly assigned. Is this really a fair or accurate approach?

The fact is that none of these statistical methods are valid for high-stakes decisions. National education testing experts warn against this. The data is useful and interesting, but DCPS goes way too far in thinking that the test score analysis somehow holds the truth. We need to stop basing judgments about schools, teachers, and high stakes decisions on standardized test scores. Period. DCPS has become an outlier nationally in the wieght it attaches to test scores.

What we have here is a teachable moment. Let's stop playing games with numbers and get back to what education should be about.

by Mark on Oct 7, 2013 11:50 am • linkreport

But the methodology you prefer compares this year's students to last year's students for the same teacher or school -- same teacher, different students.

No, this is not how MGP is calculated.

MGP is calculated by taking each student's growth percentile in the class (that is, how much they gained from one year to the next) and averaging those together.

The people who design these metrics are professionals. They don't make idiotic and simplistic data decisions like comparing one group of students to an entirely different group of students and saying that's a metric of growth. To think that they do shows an extreme misunderstanding of what the process is.

by MLD on Oct 7, 2013 12:21 pm • linkreport

Seems to me:
1. Cut scores on any test like this are judgement calls -- supposedly by experts, but judgement calls nonetheless, made by people who have vested interests in making things look good.
2. There is no basis in reality for claiming "historic" gains on the 2013 test as did Gray and numerous other officials.
3. Numerous experts on testing have testified and written that using these test scores as 'value-added' metrics (VAM) for high-stakes decisions on teachers' and administrators' careers is not what the tests were designed to do in the first place.
4. The reliability of VAM from year to year for the same teacher, teaching the same grade level to similar but different kids, is extremely low. Teachers can have (and often do have) really great numbers one year and really bad ones the next, or vice versa. Seems to me that VAM is closer to a random-number generator than anything useful.

by guy on Oct 7, 2013 1:41 pm • linkreport

[This comment has been deleted for violating the comment policy.]

by BooBoo the Fool on Oct 10, 2013 12:02 pm • linkreport

Add a Comment

Name: (will be displayed on the comments page)

Email: (must be your real address, but will be kept private)

URL: (optional, will be displayed)

You can use some HTML, like <blockquote>quoting another comment</blockquote>, <i>italics</i>, and <a href="http://url_here">hyperlinks</a>. More here.

Your comment:

By submitting a comment, you agree to abide by our comment policy.
Notify me of followup comments via email. (You can also subscribe without commenting.)
Save my name and email address on this computer so I don't have to enter it next time, and so I don't have to answer the anti-spam map challenge question in the future.