Research Study Shows L.A. Times Teacher Ratings Are Neither Reliable Nor Valid

New Research Shows Serious Flaws in the Research Behind the L.A. Times’
Controversial Ratings of Individual Teacher Performance

Contact:
Derek Briggs, University of Colorado at Boulder, (303) 492-6320. derek.briggs@colorado.edu
William Mathis, NEPC, (802) 383-0058, William.Mathis@colorado.edu

BOULDER, CO (February 8, 2011) – A new study published today by the National Education Policy Center finds that the research on which the Los Angeles Times relied for its teacher effectiveness reporting was demonstrably inadequate to support the published rankings.

Due Diligence and the Evaluation of Teachers by Derek Briggs and Ben Domingue of the University of Colorado at Boulder used the same L.A. Unified School District (LAUSD) dataset and replicated the methods of the Times’ researcher but then probed deeper and found the earlier research to have serious weaknesses. Based on the results of the Briggs and Domingue research, NEPC director Kevin Welner said, “This study makes it clear that the L.A. Times and its research team have done a disservice to the teachers, students, and parents of Los Angeles. The Times owes its community a better accounting for its decision to publish the names and rankings of individual teachers when it knew or should have known that those rankings were based on a questionable analysis. In any case, the Times now owes its community an acknowledgment of the tremendous weakness of the results reported and an apology for the damage its reporting has done.”

In August 2010 the Los Angeles Times published ratings that purported to show the teaching effectiveness of individual Los Angeles teachers. The teachers’ ratings were based on an analysis of their students’ performance on California state standardized reading and math tests. The researcher hired by the Times, Richard Buddin of the RAND Corporation (who conducted the work as a project independent of RAND itself), also published his work as a “white paper,” which provided the template from which Briggs and Domingue worked.

Buddin used a relatively simple value-added model to assess individual teacher performance for the period from 2003 to 2009. He found significant variability in LAUSD teacher quality, as demonstrated by student performance on standardized tests in reading and math, and he concluded that differences between “high-performing” and “low-performing” teachers accounted for differences in student performance. Yet, as Briggs and Domingue explain, simply finding that a value-added model yields different outcomes for different teachers does not tell us whether those outcomes are measuring what is important (teacher effectiveness) or something else, such as whether students benefit from other learning resources outside of school.

Their research explored whether there was evidence of this kind of bias by conducting what researchers call a “sensitivity analysis” to test whether the results from the L.A. Times model were valid and reliable. First, they investigated whether, when using the L.A. Times model, a student’s teacher in the future would appear to have an effect on a student’s test performance in the past—something that is logically impossible and a sign that the model is flawed. This is analogous to using a value-added model to isolate the effect of an NBA coach on the performance of his players.

At first glance we might not be surprised when the model indicates that Phil Jackson is an effective coach. But if the same model could also be used to indicate that Phil Jackson improved Kobe Bryant’s performance when he was in high school, we might wonder whether the model was truly able to separate Jackson’s ability as a coach from his good fortune at being surrounded by extremely talented players. Briggs and Domingue found strong evidence of these illogical results when using the L.A. Times model, especially for reading outcomes: “Because our sensitivity test did show this sort of backwards prediction, we can conclude that estimates of teacher effectiveness in LAUSD are a biased proxy for teacher quality.”

Next, they developed an alternative, arguably stronger value-added model and compared the results to the L.A. Times model. In addition to the variables used in the Times’ approach, they controlled for (1) a longer history of a student’s test performance, (2) peer influence, and (3) school-level factors. If the L.A. Times model were perfectly accurate, there would be no difference in results between the two models. But this was not the case. For reading outcomes, their findings included the following:

More than half (53.6%) of the teachers had a different effectiveness rating under the alternative model.
Among those who changed effectiveness ratings, some moved only moderately, but 8.1% of those teachers identified as “more” or “most” effective under the alternative model are identified as “less” or “least” effective in the L.A. Times model, and 12.6% of those identified as relatively ineffective under the alternative model are identified as effective by the L.A. Times model. The math outcomes weren’t quite as troubling, but the findings included the following:
Only 60.8% of teachers would retain the same effectiveness rating under both models.
Among those who did change effectiveness ratings, some moved only moderately, but 1.4% of those teachers identified as effective under the alternative model are identified as ineffective in the L.A. Times model, and 2.7% would go from a rating of ineffective under the alternative model to effective under the L.A. Times model.

Accordingly, the effects estimated for LAUSD teachers can be quite sensitive to choices concerning the underlying statistical model. The choice of one reasonable model would lead to very different conclusions about individual teachers than would the choice of a different reasonable model.

Briggs and Domingue then examined the precision of Buddin’s teacher-effect estimates – whether the approach can be used to reliably distinguish between teachers given different value-added ratings. They find that between 43% and 52% of teachers cannot be distinguished from a teacher of “average” effectiveness, once the specific value-added estimate for each teacher is bounded by a 95% confidence interval. Because the L.A. Times did not use this more conservative approach to distinguish teachers when rating them as effective or ineffective, it is likely that there are a significant number of false positives (teachers rated as effective who are really average), and false negatives (teachers rated as ineffective who are really average) in the L.A. Times’ rating system.

Using the Times’ approach of including only teachers with 60 or more students, there was likely a misclassification of 22% (for reading) and 14% (for math). The new report also finds evidence that conflicts with Buddin’s finding that traditional teacher qualifications have no association with student outcomes. In fact, the researchers found significant and meaningful associations between value-added estimates of teachers’ effectiveness and their experience and educational background.

Yesterday, on Monday February 7, 2011, the Times published a story about this new study. That story included false statements and was generally misleading. Accordingly, along with this study’s release, we are publishing a “Fact Sheet” about the Times’ new article (http://nepc.colorado.edu/sites/default/files/FactSheet_0.pdf)

Find Due Diligence and the Evaluation of Teachers, by Derek Briggs and Ben Domingue, on the web at: http://nepc.colorado.edu/publication/due-diligence

The mission of the National Education Policy Center is to produce and disseminate high-quality, peer-reviewed research to inform education policy discussions. We are guided by the belief that the democratic governance of public education is strengthened when policies are based on sound evidence.

For more information on NEPC, please visit http://nepc.colorado.edu/. This research brief was made possible in part by the support of the Great Lakes Center for Education Research and Practice (greatlakescenter.org).