Gates Still Doesn’t Get It! Trapped in a World of Circular Reasoning & Flawed Frameworks

Bruce D. Baker

January 10, 2013

Not much time for a thorough review of the most recent release of the Gates MET project, but here are my first cut comments on the major problems with the report. The take home argument of the report seems to be that their proposed teacher evaluation models are sufficiently reliable for prime time use and that the preferred model should include about 33 to 50% test score based statistical modeling of teacher effectiveness coupled with at least two observations on every teacher. They come to this conclusion by analyzing data on 3,000 or so teachers across multiple cities. They arrive at the 33 to 50% figure, coupled with two observations, by playing a tradeoff game. They find – as one might expect – that prior value added of a teacher is still the best predictor of itself a year later… but that when the weight on observations is increased, the year to year correlation for the overall rating increases (well, sort of). They still find relatively low correlations between value-added ratings for teachers on state tests and ratings for the same teachers with the same kids on higher order tests.

So, what’s wrong with all of this? Here’s my quick run-down:

1. Self-validating Circular Reasoning

I’ve written several previous posts explaining the absurdity of the general framework of this research which assumes that the “true indicator of teacher effectiveness” is the following year value-added score. That is, the validity of all other indicators of teacher effectiveness is measured by their correlation to the following year value added (as well as value-added when estimated to alternative tests – with less emphasis on this). Thus, the researchers find – to no freakin’ surprise – that prior year value added is, among all measures, the best predictor of itself a year later. Wow – that’s a revelation!

As a result, any weighting scheme must include a healthy dose of value-added. But, because their “strongest” predictor of itself analysis put too much weight on VAM to be politically palatable, they decided to balance the weighting by considering year to year reliability (regardless of validity).

The hypocrisy of their circular validity test is best revealed in this quote from the study:

Teaching is too complex for any single measure of performance to capture it accurately.

But apparently the validity of any/all other measures can be assessed by the correlation with a single measure (VAM itself)!?????

See also:

Evaluating Evaluation Systems

Weak Arguments for Using Weak Indicators

2. Assuming Data Models Used in Practice are of Comparable Quality/Usefulness

I would go so far as to say that it is reckless to assert that the new Gates findings on this relatively select sub-sample of teachers (for whom high quality data were available on all measures over multiple years) have much if any implication for the usefulness of the types of measures and models being implemented across states and districts.

I have discussed the reliability and bias issues in New York City’s relatively rich value-added model on several previous occasions. The NYC model (likely among the “better” VAMs) produces results that are sufficiently noisy from year to year to raise serious questions about their usefulness. Certainly, one should not be making high stakes decisions based heavily on the results of that model. Further, averaging over multiple years means, in many cases, averaging scores that jump from the 30th to 70th percentile and back again. In such cases, averaging doesn’t clarify, it masks. But what the averaging may be masking is largely noise. Averaging noise is unlikely to reveal a true signal!

Further, as I’ve discussed several times on this blog, many states and districts are implementing methods far more limited than a “high quality” VAM and in some cases states are adopting growth models that don’t attempt – or only marginally attempt – to account for any other factors that may affect student achievement over time. Even when those models to make some attempts to account for differences in students served, in many cases as in the recent technical report on the model recommended for use in New York State, those models fail! And they fail miserably. But despite the fact that those models fail so miserably at their central, narrowly specified task (parsing teacher influence on test score gain) policymakers continue to push for their use in making high stakes personnel decisions.

The new Gates findings – while not explicitly endorsing use of “bad” models – arguably embolden this arrogant, wrongheaded behavior! The report has a responsibility to be clearer as to what constitutes a better and more appropriate model versus what constitutes an entirely inappropriate one.

See also:

Reliability of NYC Value-added

On the stability of being Irreplaceable (NYC data)

Seeking Practical uses of the NYC VAM data

Comments on the NY State Model

If it’s not valid, reliability doesn’t matter so much (SGP & VAM)

3. Continued Preference for the Weighted Components Model

Finally, my biggest issue is that this report and others continue to think about this all wrong. Yes, the information might be useful, but not if forced into a decision matrix or weighting system that requires the data to be used/interpreted with a level of precision or accuracy that simply isn’t there – or worse – where we can’t know if it is.

Allow me to copy and paste one more time the conclusion section of an article I have coming out in late January:

As we have explained herein, value-added measures have severe limitations when attempting even to answer the narrow question of the extent to which a given teacher influences tested student outcomes. Those limitations are sufficiently severe such that it would be foolish to impose on these measures, rigid, overly precise high stakes decision frameworks. One simply cannot parse point estimates to place teachers into one category versus another and one cannot necessarily assume that any one individual teacher’s estimate is necessarily valid (non-biased). Further, we have explained how student growth percentile measures being adopted by states for use in teacher evaluation are, on their face, invalid for this particular purpose. Overly prescriptive, overly rigid teacher evaluation mandates, in our view, are likely to open the floodgates to new litigation over teacher due process rights, despite much of the policy impetus behind these new systems supposedly being reduction of legal hassles involved in terminating ineffective teachers.

This is not to suggest that any and all forms of student assessment data should be considered moot in thoughtful management decision making by school leaders and leadership teams. Rather, that incorrect, inappropriate use of this information is simply wrong – ethically and legally (a lower standard) wrong. We accept the proposition that assessments of student knowledge and skills can provide useful insights both regarding what students know and potentially regarding what they have learned while attending a particular school or class. We are increasingly skeptical regarding the ability of value-added statistical models to parse any specific teacher’s effect on those outcomes. Further, the relative weight in management decision-making placed on any one measure depends on the quality of that measure and likely fluctuates over time and across settings. That is, in some cases, with some teachers and in some years, assessment data may provide leaders and/or peers with more useful insights. In other cases, it may be quite obvious to informed professionals that the signal provided by the data is simply wrong – not a valid representation of the teacher’s effectiveness.

Arguably, a more reasonable and efficient use of these quantifiable metrics in human resource management might be to use them as a knowingly noisy pre-screening tool to identify where problems might exist across hundreds of classrooms in a large district. Value-added estimates might serve as a first step toward planning which classrooms to observe more frequently. Under such a model, when observations are completed, one might decide that the initial signal provided by the value-added estimate was simply wrong. One might also find that it produced useful insights regarding a teacher’s (or group of teachers’) effectiveness at helping students develop certain tested algebra skills.

School leaders or leadership teams should clearly have the authority to make the case that a teacher is ineffective and that the teacher even if tenured should be dismissed on that basis. It may also be the case that the evidence would actually include data on student outcomes – growth, etc. The key, in our view, is that the leaders making the decision – indicated by their presentation of the evidence – would show that they have used information reasonably to make an informed management decision. Their reasonable interpretation of relevant information would constitute due process, as would their attempts to guide the teacher’s improvement on measures over which the teacher actually had control.

By contrast, due process is violated where administrators/decision makers place blind faith in the quantitative measures, assuming them to be causal and valid (attributable to the teacher) and applying arbitrary and capricious cutoff-points to those measures (performance categories leading to dismissal). The problem, as we see it, is that some of these new state statutes require these due process violations, even where the informed, thoughtful professional understands full well that she is being forced to make a wrong decision. They require the use of arbitrary and capricious cutoff-scores. They require that decision makers take action based on these measures even against their own informed professional judgment.