VAM Gets Slammed: Teacher Evaluation Not a Game of Chance

Teachers in New York City have been slammed by the publication of VAM ratings for 18,000 of them. But something new is happening this time, and it was not what proponents of the use of these ratings for evaluative purposes intended or wanted. People are actually delving into the data to see what it shows.

This week Teach For America founder CEO Wendy Kopp became the latest advocate of VAM to denounce the publication of scores in the press, and the associated public scorning of the "bad teachers" they supposedly revealed.

In taking this stand, she joined one of her chief sponsors, Bill Gates. His foundation has spent millions on developing and advocating for the use of VAM in teacher evaluations. Why are these advocates of VAM so disturbed?

According to Kopp and Gates, we do not get anywhere through "public humiliation."

So how do they believe these scores ought to be used?

Kopp writes:

So-called value-added rankings--which rank teachers according to the recorded growth in their students' test scores--are an important indicator of teacher effectiveness, but making them public is counterproductive to helping teachers improve. Doing so doesn't help teachers feel safe and respected, which is necessary if they are going to provide our kids with the positive energy and environment we all hope for.

We should make individual teacher ratings available to school principals to inform their work recruiting and developing teaching faculties, but releasing them publicly undermines the trust they need to build strong, collaborative teams.

This is remarkably vague, but I gather she means that principals should be using this data for evaluative purposes behind closed doors within the school.

Gates took a similar position a couple of weeks ago, writing:

Value-added ratings are one important piece of a complete personnel system. But student test scores alone aren't a sensitive enough measure to gauge effective teaching, nor are they diagnostic enough to identify areas of improvement.

Meanwhile, across the country, states and school districts are preparing to implement new evaluation systems that correspond to this design. In New York, Governor Cuomo sponsored a new law that will have 40% of a teacher's evaluation based on what he terms "objective" VAM ratings. In this chart released by Cuomo's office a few weeks ago, New York is shown to be one of 22 states now using test scores to rate teachers:

But the idea that this data has any constructive role at all is taking a beating, along with the unfortunate teachers being pilloried by the press.

Blogger and math teacher Gary Rubenstein has done an amazing job analyzing the data that has been made public in New York. Here are some of his findings.

In his first attempt, Gary worked with a basic assumption. If VAM scores are at all accurate, there ought to be a significant correlation between a teacher's score one year compared to the next. In other words, good teachers should have somewhat consistently higher scores, and poor teachers ought to remain poor. He created a scatter plot that put the ratings from 2009 on one axis, and the ratings from 2010 on the other axis. What should we expect here? If there is a correlation, we should see some sort of upward sloping line. Here is what Gary got:

Why are the dots all over the place like a Jackson Pollack painting? Because 50% of the teachers had a 21 point swing in one direction or another.

Why is this? Well, it might be due to the fact that our students vary, and not even the complex formulas used to calculate VAM ratings can account for all the variation between students, and how those dynamics play out in a given class. And many of these things are beyond a teacher's control.

For his second data run, Gary decided to focus on an interesting aspect of the data. Many teachers have two VAM scores for the same year, because they teach the different subjects to the same students or because they teach two different grade levels of the same subject. Once again, we should see an upward sloping line.

This graph shows the scores of two different grade levels of students with the same teacher.

And for his third run, Gary compared three groups of schools; Regular public schools, shown in blue, KIPP charter schools, in yellow, and non-KIPP charters, shown in red. You can see from this graph that the charters are not doing any better at "adding value" than are the regular public schools. Furthermore, the fact that charters all show up near the 50th percentile mark shows that these schools are not, as is often claimed, serving a high proportion of low-performing students.

Gary has lots more insights in his analyses, and they are worth reading.

There is one huge takeway from all this. VAM ratings are not an accurate reflection of a teacher's performance, even on the narrow indicators on which they focus. If an indicator is unreliable, it is a farce to call it "objective."

This travesty has the effect of discrediting the whole idea of using test score data to drive reform. What does it say about "reformers" when they are willing to base a large part of teacher and principal evaluations on such an indicator?

How should we be looking at student learning in the context of teacher evaluation? Student learning ought to be at the heart of a strong evaluation process. But we have to be very careful with how we define and measure this. My next post will focus on this question.

What do you think? What does this show us about Value Added Models and education reform?

All graphs are by Gary Rubinstein, used with his permission.