Skip to main content

VAMboozled!: Why Standardized Tests Should Not Be Used to Evaluate Teachers (and Teacher Education Programs)

David C. Berliner, Regents’ Professor Emeritus here at Arizona State University (ASU), who also just happens to be my former albeit forever mentor, recently took up research on the use of test scores to evaluate teachers, for example, using value-added models (VAMs). While David is world-renowned for his research in educational psychology, and more specific to this case, his expertise on effective teaching behaviors and how to capture and observe them, he has also now ventured into the VAM-related debates.

Accordingly, he recently presented his newest and soon-to-be-forthcoming published research on using standardized tests to evaluate teachers, something he aptly termed in the title of his presentation “A Policy Fiasco.” He delivered his speech to an audience in Melbourne, Australia, and you can click here for the full video-taped presentation; however, given the whole presentation takes about one hour to watch, although I must say watching the full hour is well worth it, I highlight below what are his highlights and key points. These should certainly be of interest to you all as followers of this blog, and hopefully others.

Of main interest are his 14 reasons, “big and small’ for [his] judgment that assessing teacher competence using standardized achievement tests is nearly worthless.”

Here are his fourteen reasons:

  1. “When using standardized achievement tests as the basis for inferences about the quality of teachers, and the institutions from which they came, it is easy to confuse the effects of sociological variables on standardized test scores” and the effects teachers have on those same scores. Sociological variables (e.g., chronic absenteeism) continue to distort others’ even best attempts to disentangle them from the very instructional variables of interest. This, what we also term as biasing variables, are important not to inappropriately dismiss, as purportedly statistically “controlled for.”
  2. In law, we do not hold people accountable for the actions of others, for example, when a child kills another child and the parents are not charged as guilty. Hence, “[t]he logic of holding [teachers and] schools of education responsible for student achievement does not fit into our system of law or into the moral code subscribed to by most western nations.” Related, should medical school or doctors, for that matter, be held accountable for the health of their patients? One of the best parts of his talk, in fact, is about the medical field and the corollaries Berliner draws between doctors and medical schools, and teachers and colleges of education, respectively (around the 19-25 minute mark of his video presentation).
  3. Professionals are often held harmless for their lower success rates with clients who have observable difficulties in meeting the demands and the expectations of the professionals who attend to them. In medicine again, for example, when working with impoverished patients, “[t]here is precedent for holding [doctors] harmless for their lowest success rates with clients who have observable difficulties in meeting the demands and expectations of the [doctors] who attend to them, but the dispensation we offer to physicians is not offered to teachers.”
  4. There are other quite acceptable sources of data, besides tests, for judging the efficacy of teachers and teacher education programs. “People accept the fact that treatment and medicine may not result in the cure of a disease. Practicing good medicine is the goal, whether or not the patient gets better or lives. It is equally true that competent teaching can occur independent of student learning or of the achievement test scores that serve as proxies for said learning. A teacher can literally “save lives” and not move the metrics used to measure teacher effectiveness.
  5. Reliance on standardized achievement test scores as the source of data about teacher quality will inevitably promote confusion between “successful” instruction and “good” instruction. “Successful” instruction gets test scores up. “Good” instruction leaves lasting impressions, fosters further interest by the students, makes them feel competent in the area, etc. Good instruction is hard to measure, but remains the goal of our finest teachers.
  6. Related, teachers affect individual students greatly, but affect standardized achievement test scores very little. All can think of how their own teachers impacted their lives in ways that cannot be captured on a standardized achievement test.  Standardized achievement test scores are much more related to home, neighborhood and cohort than they are to teachers’ instructional capabilities. In more contemporary terms, this is also due the fact that large-scale standardized tests have (still) never been validated to measure student growth over time, nor have they been validated to attribute that growth to teachers. “Teachers have huge effects, it’s just that the tests are not sensitive to them.”
  7. Teacher’s effects on standardized achievement test scores fade quickly, barely discernable after a few years. So we might not want to overly worry about most teachers’ effects on their students—good or bad—as they are hard to detect on tests after two or so years. To use these ephemeral effects to then hold teacher education programs accountable seems even more problematic.
  8. Observational measures of teacher competency and achievement tests of teacher competency do not correlate well. This suggest nothing more than that one or both of these measures, and likely the latter, are malfunctioning in their capacities to measure the teacher effectiveness construct. See other Vamboozled posts about this here, here, and here.
  9. Different standardized achievement tests, both purporting to measure reading, mathematics, or science at the same grade level, will give different estimates of teacher competency. That is because different test developers have different visions of what it means to be competent in each of these subject areas. Thus one achievement test in these subject areas could find a teacher exemplary, but another test of those same subject areas would find the teacher lacking. What then? Have we an unstable teacher or an ill-defined subject area?
  10. Tests can be administered early or late in the fall, early or late in the spring, and the dates they are given influence the judgments about whether a teacher is performing well or poorly. Teacher competency should not be determined by minor differences in the date of testing, but that happens frequently.
  11. No standardized achievement tests have provided proof that their items are instructionally sensitive. If test items do not, because they cannot “react to good instruction,” how can one make a claim that the test items are “tapping good instruction?”
  12. Teacher effects show up more dramatically on teacher made tests than on standardized achievement tests because the former are based on the enacted curriculum, while the latter are based on the desired curriculum. You get seven times more instructionally sensitive tests the closer the test is to the classroom (i.e., teacher made tests).
  13. The opt-out testing movement invalidates inferences about teachers and schools that can be made from standardized achievement test results. Its not bad to remove these kids from taking these tests, and perhaps it is even necessary in our over-tested schools, but the tests and the VAM estimates derived via these tests, are far less valid when that happens. This is because the students who opt out are likely different in significant ways from those who do take the tests. This severely limits the validity claims that are made.

Assessing new teachers with standardized achievement tests is likely to yield many false negatives. That is, the assessments would identify teachers early in their careers as ineffective in improving test scores, which is, in fact, often the case for new teachers. Two or three years later that could change. Perhaps the last thing we want to do in a time of teacher shortage is discourage new teachers while they acquire their skills.

This blog post has been shared by permission from the author.
Readers wishing to comment on the content are encouraged to do so via the link to the original post.
Find the original post here:

The views expressed by the blogger are not necessarily those of NEPC.

Audrey Amrein-Beardsley

Audrey Amrein-Beardsley, a former middle- and high-school mathematics teacher, received her Ph.D. in 2002 from Arizona State University (ASU) from the Division of...