Why Test Scores CAN'T Evaluate Teachers

Jersey Jazzman

June 13, 2013

I'm going to have more to say about Jonah Rockoff's testimony before the New Jersey State Board of Education last week. See here and here for previous posts; Bruce Baker also opines in a must-read post.

But right now, I want to use a specific part Dr. Rockoff's presentation to address a very serious problem with the entire notion of test-based teacher accountability. Keep in mind that Rockoff talks about Student Growth Percentiles (SGPs), but the problem extends to just about any use of Value-Added Modeling (VAM) in teacher evaluation based on test scores. Go to 1:07 in the clip:

The key element here that distinguishes Student Growth Percentiles from some of the other things that people have used in research is the use of percentiles. It's there in the title, so you'd expect it to have something to do with percentiles. What does that mean? It means that these measures are scale-free. They get away from psychometric scaling in a way that many researchers - not all, but many - say is important.

Now these researchers are not psychometricians, who aren't arguing against the scale. The psychometricians as who create our tests, they create a scale, and they use scientific formulae and theories and models to come up with a scale. It's like on the SAT, you can get between 200 and 800. And the idea there is that the difference in the learning or achievement between a 200 and a 300 is the same as between a 700 and an 800.

There is no proof that that is true. There is no proof that that is true. There can't be any proof that is true. But, if you believe their model, then you would agree that that's a good estimate to make. There are a lot of people who argue... they don't trust those scales. And they'd rather use percentiles because it gets them away from the scale.

Let's state this another way so we're absolutely clear: there is, according to Jonah Rockoff, no proof that a gain on a state test like the NJASK from 150 to 160 represents the same amount of "growth" in learning as a gain from 250 to 260. If two students have the same numeric growth but start at different places, there is no proof that their "growth" is equivalent.

Now there's a corollary to this, and it's important: you also can't say that two students who have different numeric levels of "growth" are actually equivalent. I mean, if we don't know whether the same numerical gain at different points on the scale are really equivalent, how can we know whether one is actually "better" or "worse"? And if that's true, how can we possibly compare different numerical gains?

Keep this in mind as we, once again, go through a thought exercise with our friend, Jenny. You may remember from previous posts (here and here) that Jenny is a hypothetical 4th grader who just took the NJASK-4; we're looking to see the implications of Jenny's subsequent SGP. Here's how Jenny "grew" from last year:

Jenny scored a 256 on the NJASK-3 last year, and a 261 on the NJASK-4 this year. She "grew" 5 points over the year (keep in mind that, because the 4th grade test is "harder" than the 3rd grade test, a student can "grow" even if her score drops).

Because she scored a 256 in 3rd grade, Jenny was placed in a group of her peers to calculate her SGP; they all also scored a 256 last year. How did they do?

Let's note a couple of things: first, the distribution of growth is not what statisticians would call "normal." The distance between Jenny and Julio is only 5 points; the distance between Brittney and Susie is 38 points. But remember also what Rockoff implied: there's no way to compare those two differences. It may well be that it's "easier" to move from Brittney's spot to Susie's than it is to move from Jenny's to Julio's - we just don't know.

Let's look at another student: Angela, who scored a 150 last year. How did she do?

Angela improved her score by 25 points. That sounds wonderful... until we think about what Rockoff said: we can't compare Angela's 25 point gain to someone else, like Jenny, who started in a different place.

Let's look at Angela's "peers":

The range of Angela's cohort is greater than Jenny's: 60 points, as opposed to 50 points for Jenny's "peers." And the distributions are different: look at the differences in growth between the different percentiles in each cohort. Jenny was only 5 points away from the top in her group (there may be a ceiling effect), but Angela is 25 points away from the highest score of her peers.

Here's the thing: let's assume that Angela and Jenny have the median SGPs for their classes. In that case, their SGPs have determined that Angela and Jenny "grew" the same amount for the purposes of evaluating their teachers. Which begs a question:

Is it at all accurate to say that Jenny and Angela "grew" the same amount - and, consequently, that their teachers are equally effective? Well...

They have different raw scores, so we know they are not at the same level of achievement.

Each "grew" a different numeric amount on their scores. But if they "grew"the same amount, remember what Rockoff said: there is no proof that those would represent the same amount of learning. In the same way, we have no proof these different scores represent equivalent amounts of learning.

The SGPs are saying Angela and Jenny "grew" the same amount. But we have no proof that this is true! And not only that...

The SGP hides the different distances from the top, the bottom, and every other position in the distribution for both girls. Which I guess doesn't matter, because the measures can't be compared anyway!

Which leaves us here:

The SGP tells us only one thing: where Jenny and Angela are in relationship to their "peers" when everyone is forced into a normal distribution. But look at all the information that's hidden:

The raw score, or achievement levels.
The numeric growth.
The "actual" growth in learning.
The range of growth within the group of peers.
The distribution of growth within the peers.

And yet, even though the SGP tells us nothing about all of this information, the NJDOE (substitute your reformy state education department) confidently tells us that Jenny's and Angela's teachers are equally effective. Perhaps they believe this because they think SGPs "get away from the scales."

Except it's clear they don't do this at all - they just create another scale that is just as suspect. More to come...

This blog post has been shared by permission from the author.
Readers wishing to comment on the content are encouraged to do so via the link to the original post.
Find the original post here:

Jersey Jazzman Blog

The views expressed by the blogger are not necessarily those of NEPC.