The Toxic Trifecta, Bad Measurement & Evolving Teacher Evaluation Policies

Bruce D. Baker

April 24, 2012

This post contains my preliminary thoughts in development for a forthcoming article dealing with the intersection between statistical and measurement issues in teacher evaluation and teachers’ constitutional rights where those measures are used for making high stakes decisions.

The Toxic Trifecta in Current Legislative Models for Teacher Evaluation

A relatively consistent legislative framework for teacher evaluation has evolved across states in the past few years. Many of the legal concerns that arise do so because of inflexible, arbitrary and often ill-conceived yet standard components of this legislative template. There exist three basic features of the standard model, each of which is problematic on its own regard, and those problems become multiplied when used in combination.

First, the standard evaluation model proposed in legislation requires that objective measures of student achievement growth necessarily be considered in a weighting system of parallel components. Student achievement growth measures are assigned, for example, a 40 or 50% weight alongside observation and other evaluation measures. Placing the measures alongside one another in a weighting scheme assumes all measures in the scheme to be of equal validity and reliability but of varied importance (utility) – varied weight. Each measure must be included, and must be assigned the prescribed weight – with no opportunity to question the validity of any measure. [1]Such a system also assumes that the various measures included in the system are each scaled such that they can vary to similar degrees. That is, that the observational evaluations will be scaled to produce similar variation to the student growth measures, and that the variance in both measures is equally valid – not compromised by random error or bias. In fact, however, it remains highly likely that some components of the teacher evaluation model will vary far more than others if by no other reasons than that some measures contain more random noise than others or that some of the variation is attributable to factors beyond the teachers’ control. Regardless of the assigned weights and regardless of the cause of the variation (true or false measure) the measure that varies more will carry more weight in the final classification of the teacher as effective or not. In a system that places differential weight, but assumes equal validity across measures, even if the student achievement growth component is only a minority share of the weight, it may easily become the primary tipping point in most high stakes personnel decisions.

Second, the standard evaluation model proposed in legislation requires that teachers be placed into effectiveness categories by assigning arbitrary numerical cutoffs to the aggregated weighted evaluation components. That is, a teacher in the 25%ile or lower when combining all evaluation components might be assigned a rating of “ineffective,” whereas the teacher at the 26%ile might be labeled effective. Further, the teacher’s placement into these groupings may largely if not entirely hinge on their rating in the student achievement growth component of their evaluation. Teachers on either side of the arbitrary cutoff are undoubtedly statistically no different from one another. In many cases as with the recently released teacher effectiveness estimates on New York City teachers, the error ranges for the teacher percentile ranks have been on the order of 35%ile points (on average, up to 50% with one year of data). Assuming that there is any real difference between the teacher at the 25%ile and 26%ile (as their point estimate) is a huge unwarranted stretch. Placing an arbitrary, rigid, cut-off score into such noisy measures makes distinctions that simply cannot be justified especially when making high stakes employment decisions.

Third, the standard evaluation model proposed in legislation places exact timelines on the conditions for removal of tenure. Typical legislation dictates that teacher tenure either can or must be revoked and the teacher dismissed after 2 consecutive years of being rated ineffective (where tenure can only be achieved after 3 consecutive years of being rate effective).[2]As such, whether a teacher rightly or wrongly falls just below or just above the arbitrary cut-offs that define performance categories may have relatively inflexible consequences.

The Forced Choice between “Bad” Measures and “Wrong” Ones

Two separate camps have recently emerged in state policy regarding development and application of measures of student achievement growth to be used in newly adopted teacher evaluation systems. The first general category of methods is known as value-added models and the second as student growth percentiles. Among researchers it is well understood that these are substantively different measures by their design, one being a possible component of the other. But these measures and their potential uses have been conflated by policymakers wishing to expedite implementation of new teacher evaluation policies and pilot programs.

Arguably, one reason for the increasing popularity of the student growth percentile (SGP) approach across states is the extent of highly publicized scrutiny and large and growing body of empirical research over problems with using value-added measures for determining teacher effectiveness (See Green, Baker and Oluwole, 2012). Yet, there has been little such research on the usefulness of student growth percentiles for determining teacher effectiveness. The reason for this vacuum is not that student growth percentiles are simply not susceptible to the problems of value-added models, but that researchers have chosen not to evaluate their validity for this purpose – estimating teacher effectiveness – because they are not designed to infer teacher effectiveness.

A value added estimate uses assessment data in the context of a statistical model (regression analysis), where the objective is to estimate the extent to which a student having a specific teacher or attending a specific school influences that student’s difference in score from the beginning of the year to the end of the year – or period of treatment (in school or with teacher). The most thorough of VAMs attempt to account for several prior year test scores (to account for the extent that having a certain teacher alters a child’s trajectory), classroom level mix of students, individual student background characteristics, and possibly school characteristics. The goal is to identify most accurately the share of the student’s or group of students’ value-added that should be attributed to the teacher as opposed to other factors outside of the teachers’ control.

By contrast, a student growth percentile is a descriptive measure of the relative change of a student’s performance compared to that of all students and based on a given underlying test or set of tests. That is, the individual scores obtained on these underlying tests are used to construct an index of student growth, where the median student, for example, may serve as a baseline for comparison. Some students have achievement growth on the underlying tests that is greater than the median student, while others have growth from one test to the next that is less. That is, the approach estimates not how much the underlying scores changed, but how much the student moved within the mix of other students taking the same assessments, using a method called quantile regression to estimate the rarity that a child falls in her current position in the distribution, given her past position in the distribution.[3] Student growth percentile measures may be used to characterize each individual student’s growth, or may be aggregated to the classroom level or school level, and/or across children who started at similar points in the distribution to attempt to characterize collective growth of groups of students.

Many, if not most value-added models also involve normative rescaling of student achievement data, measuring in relative terms how much individual students or groups of students have moved within the large mix of students. The key difference is that the value-added models include other factors in an attempt to identify the extent to which having a specific teacher contributed to that growth, whereas student growth percentiles are simply a descriptive measure of the growth itself. A student growth percentile measure could be used in a value-added model.

As described by the authors of the Colorado Growth Model:

A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.” (Betebenner, Wenning & Briggs, 2011)

Unlike value-added teacher effect estimates, student growth percentiles are not intended for attribution of responsibility for student progress to either the teacher or the school. But if that is so clearly the case (as recently stated as Fall, 2011) is it plausible that states or local school districts will actually choose to use the measures to make inferences? Below is a brief explanation from a Q&A section of the New Jersey Department of Education web site regarding implementation of pilot teacher evaluation programs:

Standardized test scores are not available for every subject or grade. For those that exist (Math and English Language Arts teachers of grades 4-8), Student Growth Percentages (SGPs), which require pre- and post-assessments, will be used. The SGPs should account for 35%-45% of evaluations. The NJDOE will work with pilot districts to determine how student achievement will be measured in non-tested subjects and grades.[4]

This explanation clearly indicates that student growth percentile data are to be used for “evaluation” of teacher effectiveness. In fact, the SGPs alone, as they stand, as descriptive measures “should be used to account for 35% to 45% of evaluations.” Other states including Colorado have already adopted (pioneered) the use of Student Growth Percentiles as a statewide accountability measure and have concurrently passed high stakes teacher evaluation legislation. But it remains to be seen how the SGP data will be used in district specific contexts in guiding high stakes decisions.

While value-added models are intended estimate teacher effects on student achievement growth, they fail to do so in any accurate or precise way (see Green, Oluwole & Baker, 2012). By contrast, student growth percentiles make no such attempt.[5] Specifically, value-added measures tend to be highly unstable from year to year, and have very wide error ranges when applied to individual teachers, making confident distinctions between “good” and “bad” teachers difficult if not impossible. Further, while value-added models attempt to isolate that portion of student achievement growth that is caused by having a specific teacher they often fail to do so and it is difficult if not impossible to discern a) how much they have failed and b) in which direction for which teachers. That is, the individual teacher estimates may be biased by factors not fully addressed in the models, and we may not know how much. We also know that when different tests are used for the same content, teacher receive widely varied ratings raising additional questions about the validity of the measures.

While we do not have similar information from existing research on student growth percentiles, it stands to reason that since they are based on the same types of testing data, they will be similarly susceptible to error and noise. But more problematically, since student growth percentiles make no attempt (by design) to consider other factors that contribute to student achievement growth, the measures have significant potential for omitted variables bias. SGPs leave the interpreter of the data to naively infer (by omission) that all growth among students in the classroom of a given teacher must be associated with that teacher. Even subtle changes to explanatory variables in value-added models change substantively the ratings of individual teachers (Ballou et al., 2012, Briggs & Domingue, 2010). Excluding all potential explanatory variables, as do SGPs, takes this problem to the extreme. As a result, it may turn out that SGP measures at the teacher level appear more stable from year to year than value-added estimates, but that stability may be entirely a function of teachers serving similar populations of students from year to year. That is, the measures may contain stable omitted variables bias, and thus may be stable in their invalidity.

In defense of Student Growth Percentiles as accountability measures but with no mention of their use for teacher evaluation, Betebenner, Wenning and Briggs (2011) explain that one school of thought is that value-added estimates are also most reasonably interpreted as descriptive measures, and should not be used to infer teacher or school effectiveness:

“The development of the Student Growth Percentile methodology was guided by Rubin et al’s (2004) admonition that VAM quantities are, at best, descriptive measures.” (Betebenner, Wenning & Briggs, 2011)

Rubin et al explain:

“Value-added assessment is a complex issue, and we appreciate the efforts of Ballou et al. (2004), McCaffrey et al. (2004) and Tekwe et al. (2004). However, we do not think that their analyses are estimating causal quantities, except under extreme and unrealistic assumptions. We argue that models such as these should not be seen as estimating causal effects of teachers or schools, but rather as providing descriptive measures.” (Rubin et al., 2004)

Arguably, these explanations do less to validate the usefulness of Student Growth Percentiles as accountability measures (inferring attribution and/or responsibility to schools and teachers) and far more to invalidate the usefulness of both Student Growth Percentiles and Value-Added Models for these purposes.

New Jersey’s TEACHNJ: At The Intersection of the Toxic Trifecta and “Wrong” Measures

A short while back, John Mooney over at NJ Spotlight provided an overview of a pending bill in the New Jersey legislature which just so happens to contain explicitly at least two out of three of the elements of the Toxic Trifecta and contains the third implicitly by granting deference to the NJ Department of Education to approve the quantitative measures used in evaluation systems.

Text of the Bill: http://www.njleg.state.nj.us/2012/Bills/S0500/407_I1.PDF

First, the bill throughout refers to the creation of performance categories as discussed above, implicitly if not explicitly declaring those categories to be absolute, clearly defined and fully differentiable from one another.

Second, while the bill is not explicit in its requirement of specific quantified performance metrics the bill grants latitude on this matter to the NJ Department of Education (to approve local plans) which a) is developing a student growth percentile model to be used for these purposes, and b) under its pilot plan is suggesting (if not requiring) that districts use the student growth percentile data for 35 to 45% of evaluations, as noted above.

Third, the bill places an absolute and inflexible timeline on dismissal:

Notwithstanding any provision of law to the contrary, the principal, in consultation with the panel, shall revoke the tenure granted to an employee in the position of teacher, assistant principal, or vice-principal if the employee is evaluated as ineffective in two consecutive annual evaluations. (p. 10)

The key word here is “shall” which indicates a statutory obligation to revoke tenure. It does not say “may,” or “at the principal’s discretion.” It says shall.

The principal shall revoke tenure if a teacher is unlucky enough to land below an arbitrary cut-point, using a measure not designed for such purposes, for two years in a row. (even if the teacher was lucky enough to achieve an “awesome” rating every other year of her career!)

The kicker is that the bill goes one step further to attempt to eliminate any due process right a teacher might have to challenge the basis for the dismissal:

The revocation of the tenure status of a teacher, assistant principal, or vice-principal shall not be subject to grievance or appeal except where the ground for the grievance or appeal is that the principal failed to adhere substantially to the evaluation process. (p. 10)

In other words, the bill attempts to establish that teachers shall have no basis (no procedural due process claim) for grievance as long as the principal has followed their evaluation plan, ignoring the possibility – the fact – that these evaluation plans themselves, approved or not, will create scenarios and cause personnel decisions which violate due process rights. Further, the attempt at restricting due process rights laid out in the bill itself is a threat to due process and would likely be challenged.

Declaring any old process to constitute due process does not make it so! Especially where the process is built on not only “bad” but “wrong” measures used in a framework that forces dismissal decisions on at least 3 completely arbitrary and capricious bases (2 consecutive years in isolation, fixed weight on wrong measure, arbitrary cut-points for performance categories).

So this raises the big question of what’s behind all of this. Clearly, one thing that’s behind all of this is an astonishing ignorance of statistics and measurement among state legislators favoring the toxic trifecta – either that or a willful neglect of their legislative duty to respect constitutional protections including due process (or both!).

[1] A more reasonable alternative being to use the statistical information as a preliminary screening tool for identifying potential problem areas, and then using more intensive observations and additional evaluation tools as follow-up. This approach acknowledges that the signals provided by the statistical information may in fact be false either as a function of reliability problems or lacking validity (other conditions contributed to the rating), and therefore in some if not many cases, should be discarded. The parallel consideration more commonly used requires that the student growth metric be considered and weighted as prescribed, reliable, valid or not.

[2] For example, at the time of writing this draft, the bill introduced in New Jersey read: “Notwithstanding any provision of law to the contrary, the principal shall revoke the tenure granted to an employee in the position of teacher, assistant principal, or vice-principal, regardless of when the employee acquired tenure, if the employee is evaluated as ineffective or partially effective in one year’s annual summative evaluation and in the next year’s annual summative evaluation the employee does not show improvement by being evaluated in a higher rating category. The only evaluations which may be used by the principal for tenure revocation are those evaluations conducted in the 2013-2014 school year and thereafter which use the rubric adopted by the board and approved by the commissioner. The school improvement panel may make recommendations to the principal on a teacher’s tenure revocation.” http://www.njspotlight.com/assets/12/0203/0158

[3] For more precise explanations, see: http://dirwww.colorado.edu/education/faculty/derekbriggs/Docs/Briggs_Weeks_Is%20Growth%20in%20Student%20Achievement%20Scale%20Dependent.pdf

[4] http://www.state.nj.us/education/EE4NJ/faq/

[5] Briggs and Betebenner (2009) explain: “However, there is an important philosophical difference between the two modeling approaches in that Betebenner (2008) has focused upon the use of SGPs as a descriptive tool to characterize growth at the student-level, while the LM (layered model) is typically the engine behind the teacher or school effects that get produced for inferential purposes in the EVAAS.” (Briggs & Betebenner, 2009, p. )

References

Alexander, K.L, Entwisle, D.R., Olsen, L.S. (2001) Schools, Achievement and Inequality: A Seasonal Perspective. Educational Evaluation and Policy Analysis 23 (2) 171-191

Ballou, D., Mokher, C.G., Cavaluzzo, L. (2012) Using Value-Added Assessment for Personnel Decisions: How Omitted Variables and Model Specification Influence Teachers’ Outcomes. Annual Meeting of the Association for Education Finance and Policy. Boston, MA. http://aefpweb.org/sites/default/files/webform/AEFP-Using%20VAM%20for%20personnel%20decisions_02-29-12.docx

Ballou, D. (2012). Review of “The Long-Term Impacts of Teachers: Teacher Value-Added and Student Outcomes in Adulthood.” Boulder, CO: National Education Policy Center. Retrieved [date] from http://nepc.colorado.edu/thinktank/review-long-term-impacts

Baker, E.L., Barton, P.E., Darling-Hammong, L., Haertel, E., Ladd, H.F., Linn, R.L., Ravitch, D., Rothstein, R., Shavelson, R.J., Shepard, L.A. (2010) Problems with the Use of Student Test Scores to Evaluate Teachers. Washington, DC: Economic Policy Institute. http://epi.3cdn.net/724cd9a1eb91c40ff0_hwm6iij90.pdf

Betebenner, D., Wenning, R.J., Briggs, D.C. (2011) Student Growth Percentiles and Shoe Leather. http://www.ednewscolorado.org/2011/09/13/24400-student-growth-percentil…

Boyd, D.J., Lankford, H., Loeb, S., & Wyckoff, J.H. (July, 2010). Teacher layoffs: An empirical illustration of seniority vs. measures of effectiveness. Brief 12. National Center for Evaluation of Longitudinal Data in Education Research. Washington, DC: The Urban Institute.

Briggs, D., Betebenner, D., (2009) Is student achievement scale dependent? Paper presented at the invited symposium Measuring and Evaluating Changes in Student Achievement: A Conversation about Technical and Conceptual Issues at the annual meeting of the National Council for Measurement in Education, San Diego, CA, April 14, 2009. http://dirwww.colorado.edu/education/faculty/derekbriggs/Docs/Briggs_Weeks_Is%20Growth%20in%20Student%20Achievement%20Scale%20Dependent.pdf

Briggs, D. & Domingue, B. (2011). Due Diligence and the Evaluation of Teachers: A review of the value-added analysis underlying the effectiveness rankings of Los Angeles Unified School District teachers by the Los Angeles Times. Boulder, CO: National Education Policy Center. Retrieved [date] from http://nepc.colorado.edu/publication/due-diligence.

Budden, R. (2010) How Effective Are Los Angeles Elementary Teachers and Schools?, Aug. 2010, available at http://www.latimes.com/media/acrobat/2010-08/55538493.pdf.

Braun, H, Chudowsky, N, & Koenig, J (eds). (2010) Getting value out of value-added. Report of a Workshop. Washington, DC: National Research Council, National Academies Press.

Braun, H. I. (2005). Using student progress to evaluate teachers: A primer on value-added models. Princeton, NJ: Educational Testing Service. Retrieved February, 27, 2008.

Chetty, R., Friedman, J., Rockoff, J. (2011) The Long Term Impacts of Teachers: Teacher Value Added and Student outcomes in Adulthood. NBER Working Paper # 17699 http://www.nber.org/papers/w17699

Clotfelter, C., Ladd, H.F., Vigdor, J. (2005) Who Teaches Whom? Race and the distribution of Novice Teachers. Economics of Education Review 24 (4) 377-392

Clotfelter, C., Glennie, E. Ladd, H., & Vigdor, J. (2008). Would higher salaries keep teachers in high-poverty schools? Evidence from a policy intervention in North Carolina. Journal of Public Economics 92, 1352-70.

Corcoran, S.P. (2010) Can Teachers Be Evaluated by their Students’ Test Scores? Should they Be? The Use of Value Added Measures of Teacher Effectiveness in Policy and Practice. Annenberg Institute for School Reform. http://annenberginstitute.org/pdf/valueaddedreport.pdf

Corcoran, S.P. (2011) Presentation at the Institute for Research on Poverty Summer Workshop: Teacher Effectiveness on High- and Low-Stakes Tests (Apr. 10, 2011), available at https://files.nyu.edu/sc129/public/papers/corcoran_jennings_beveridge_2011_wkg_teacher_effects.pdf.

Corcoran, Sean P., Jennifer L. Jennings, and Andrew A. Beveridge. 2010. “Teacher Effectiveness on High- and Low-Stakes Tests.” Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI.

D.C. Pub. Sch., IMPACT Guidebooks (2011), available at http://dcps.dc.gov/portal/site/DCPS/menuitem.06de50edb2b17a932c69621014f62010/?vgnextoid=b00b64505ddc3210VgnVCM1000007e6f0201RCRD.

Education Trust (2011) Fact Sheet- Teacher Quality. Washington, DC. http://www.edtrust.org/sites/edtrust.org/files/Ed%20Trust%20Facts%20on%20Teacher%20Equity_0.pdf

Hanushek, E.A., Rivkin, S.G., (2010) Presentation for the American Economic Association: Generalizations about Using Value-Added Measures of Teacher Quality8 (Jan. 3-5, 2010), available at http://www.utdallas.edu/research/tsp-erc/pdf/jrnl_hanushek_rivkin_2010_teacher_quality.pdf

Working with Teachers to Develop Fair and Reliable Measures of Effective Teaching. MET Project White Paper. Seattle, Washington: Bill & Melinda Gates Foundation, 1. Retrieved December 16, 2010, from http://www.metproject.org/downloads/met-framing-paper.pdf.

Learning about Teaching: Initial Findings from the Measures of Effective Teaching Project. MET Project Research Paper. Seattle, Washington: Bill & Melinda Gates Foundation. Retrieved December 16, 2010, from http://www.metproject.org/downloads/Preliminary_Findings-Research_Paper.pdf.

Jackson, C.K., Bruegmann, E. (2009) Teaching Students and Teaching Each Other: The Importance of Peer Learning for Teachers. American Economic Journal: Applied Economics 1(4): 85–108

Kane, T., Staiger, D., (2008) Estimating Teacher Impacts on Student Achievement: An Experimental Evaluation. NBER Working Paper #16407 http://www.nber.org/papers/w14607

Koedel, C. (2009) An Empirical Analysis of Teacher Spillover Effects in Secondary School. 28 (6 ) 682-692

Koedel, C., & Betts, J. R. (2009). Does student sorting invalidate value-added models of teacher effectiveness? An extended analysis of the Rothstein critique. Working Paper.

Jacob, B. & Lefgren, L. (2008). Can principals identify effective teachers? Evidence on subjective performance evaluation in education. Journal of Labor Economics. 26(1), 101-36.

Sass, T.R., (2008) The Stability of Value-Added Measures of Teacher Quality and Implications for Teacher Compensation Policy. National Center for Analysis of Longitudinal Data in Educational Research. Policy Brief #4. http://eric.ed.gov/PDFS/ED508273.pdf

McCaffrey, D. F., Lockwood, J. R, Koretz, & Hamilton, L. (2003). Evaluating value-added models for teacher accountability. RAND Research Report prepared for the Carnegie Corporation.

McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., & Hamilton, L. (2004). Models for value-added modeling of teacher effects. Journal of Educational and Behavioral Statistics, 29(1), 67.

Rothstein, J. (2011). Review of “Learning About Teaching: Initial Findings from the Measures of Effective Teaching Project.” Boulder, CO: National Education Policy Center. Retrieved [date] from http://nepc.colorado.edu/thinktank/review-learning-about-teaching.

Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy, 4(4), 537–571.

Rothstein, J. (2010). Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement. Quarterly Journal of Economics, 125(1), 175–214.

Sanders, W. L., Saxton, A. M., & Horn, S. P. (1997). The Tennessee Value-Added Assessment System: A quantitative outcomes-based approach to educational assessment. In J. Millman (Ed.), Grading teachers, grading schools: Is student achievement a valid measure? (pp. 137-162). Thousand Oaks, CA: Corwin Press.

Sanders, William L., Rivers, June C., 1996. Cumulative and residual effects of teachers on future student academic achievement. Knoxville: University of Tennessee Value- Added Research and Assessment Center.

McCaffrey, D.F., Sass, T.R., Lockwood, J.R., Mihaly, K. (2009) The Intertemporal Variability of Teacher Effect Estimates. Education Finance and Policy 4 (4) 572-606

McCaffrey, D.F., Lockwood, J.R. (2011) Missing Data in Value Added Modeling of Teacher Effects. Annals of Applied Statistics 5 (2A) 773-797

Reardon, S. F. & Raudenbush, S. W. (2009). Assumptions of value-added models for estimating school effects. Education Finance and Policy, 4(4), 492–519.

Rubin, D. B., Stuart, E. A., and Zanutto, E. L. (2004). A potential outcomes view of value-added assessment in education. Journal of Educational and Behavioral Statistics, 29(1):103–116.

Schochet, P.Z., Chiang, H.S. (2010) Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains. Institute for Education Sciences, U.S. Department of Education. http://ies.ed.gov/ncee/pubs/20104004/pdf/20104004.pdf.

This blog post has been shared by permission from the author.
Readers wishing to comment on the content are encouraged to do so via the link to the original post.
Find the original post here:

School Finance 101

The views expressed by the blogger are not necessarily those of NEPC.