What Value-Added Research Does And Does Not Show
Value-added and other types of growth models are probably the most controversial issue in education today. These methods, which use sophisticated statistical techniques to attempt to isolate a teacher’s effect on student test score growth, are rapidly assuming a central role in policy, particularly in the new teacher evaluation systems currently being designed and implemented. Proponents view them as a primary tool for differentiating teachers based on performance/effectiveness.
Opponents, on the other hand, including a great many teachers, argue that the models’ estimates are unstable over time, subject to bias and imprecision, and that they rely entirely on standardized test scores, which are, at best, an extremely partial measure of student performance. Many have come to view growth models as exemplifying all that’s wrong with the market-based approach to education policy.
It’s very easy to understand this frustration. But it’s also important to separate the research on value-added from the manner in which the estimates are being used. Virtually all of the contention pertains to the latter, not the former. Actually, you would be hard-pressed to find many solid findings in the value-added literature that wouldn’t ring true to most educators.
For example, the most prominent conclusion of this body of evidence is that teachers are very important, that there’s a big difference between effective and ineffective teachers, and that whatever is responsible for all this variation is very difficult to measure (see here, here, here and here). These analyses use test scores not as judge and jury, but as a reasonable substitute for “real learning,” with which one might draw inferences about the overall distribution of “real teacher effects.”
And then there are all the peripheral contributions to understanding that this line of work has made, including (but not limited to):
- That experience does matter;
- That the quality of peers affects teacher performance;
- That teachers perform differently in different schools;
- And that students’ backgrounds explain more of the variation in their performance than school related factors
Prior to the proliferation of growth models, most of these conclusions were already known to teachers and to education researchers, but research in this field has helped to validate and elaborate on them. That’s what good social science is supposed to do.
Conversely, however, what this body of research does not show is that it’s a good idea to use value-added and other growth model estimates as heavily-weighted components in teacher evaluations or other personnel-related systems. There is, to my knowledge, not a shred of evidence that doing so will improve either teaching or learning, and anyone who says otherwise is misinformed. It’s an open question.*
As has been discussed before, there is a big difference between demonstrating that teachers matter overall – that their test-based effects vary widely, and in a manner that is not just random –and being able to accurately identify the “good” and “bad” performers at the level of individual teachers. Frankly, to whatever degree the value-added literature provides tentative guidance on how these estimates might be used productively in actual policies, it suggests that, in most states and districts, it is being done in a disturbingly ill-advised manner.
For instance, the research is very clear that the scores for individual teachers are subject to substantial random error and systematic bias, both of which can be mitigated with larger samples (teachers who have taught more students). Yet most states have taken no steps to account for random error when incorporating these estimates into evaluations, nor have any but a precious few set meaningful sample size requirements. These omissions essentially ensure that many teachers’ scores will be too imprecise to be useful, and that most teachers’ estimates will be, at the very least, interpreted improperly.
The evidence is also clear that different growth models yield different results for the same teacher, yet some states have chosen models that are less appropriate for identifying teachers’ causal effects on testing outcomes.
Finally, making all of this worse, most states are mandating that these scores count for 40-50 percent of tested teachers’ evaluations without any clue what the other components (e.g., observations) will be and how much they will vary. Many have even violated the most basic policy principles – for example, by refusing to mandate a pilot year before full implementation.
For these (and other) reasons, opponents of value-added have every reason to be skeptical of the current push to use these estimates in high-stakes decisions, and of the clumsy efforts among some advocacy organizations to erect an empirical justification for doing so. Not only is there no evidence that using these measures in high-stakes decisions will generate improvements, but the details of how it’s being done are, in most places, seemingly being ignored in a most careless, risky fashion. It’s these details that will determine whether the estimates are helpful.
Nevertheless, those who are (understandably) compelled to be dismissive or even hostile toward the research on value-added should consider that this line of work is about understanding teaching and learning, not personnel policies. Research and data are what they are; it’s how you use them that matters. And it’s unfortunate that many self-proclaimed advocates of “data-driven decisionmaking” seem more interested in starting to make decisions than in the proper use of data.
- Matt Di Carlo
* It’s true that, in many cases, the researchers provide concrete policy recommendations, but they tend to be cautious and flanked by caveats. Moreover, there are many papers in the field that do address directly the suitability of value-added estimates for use in concrete policies such as tenure decisions and layoffs, but the conclusions of these analyses are typically cautious and speculative, and none can foresee how things will play out in actual, high-stakes implementation.
This blog post has been shared by permission from the author.
Readers wishing to comment on the content are encouraged to do so via the link to the original post.
Find the original post here:
The views expressed by the blogger are not necessarily those of NEPC.