A Game of Inches

Matthew Di Carlo

June 4, 2012

Accountability and Testing

One of the more telling episodes in education I’ve seen over the past couple of years was a little dispute over Michelle Rhee’s testing record that flared up last year. Alan Ginsburg, a retired U.S. Department of Education official, released an informal report in which he presented the NAEP cohort changes that occurred during the first two years of Michelle Rhee’s tenure (2007-2009), and compared them with those during the superintendencies of her two predecessors.

Ginsburg concluded that the increases under Chancellor Rhee, though positive, were less rapid than in previous years (2000 to 2007 in math, 2003 to 2007 in reading). Soon thereafter, Paul Peterson, director of Harvard’s Program on Educational Leadership and Governance, published an article in Education Next that disputed Ginsburg’s findings. Peterson found that increases under Rhee amounted to roughly three scale score points per year, compared with around 1-1.5 points annually between 2000 and 2007 (the actual amounts varied by subject and grade).

Both articles were generally cautious in tone and in their conclusions about the actual causes of the testing trends. The technical details of the two reports – who’s “wrong” or “right” – are not important for this post (especially since more recent NAEP results have since been released). More interesting was how people reacted – and didn’t react – to the dueling analyses.

Predictably, they stirred up intense sparring between Rhee’s supporters and detractors. Putting aside the fact thatone simply cannot use raw NAEP results to assess the effectiveness of schools or policies (to say nothing of superintendents), I had a somewhat visceral initial reaction to this debate. It stemmed from my perception that, for a while, the success or failure of Michelle Rhee – everything she did or didn’t do, good or bad – was being argued on the basis of one or two NAEP points.*

I strongly believe in evidence-based decision making, and that test scores should play an important role in that endeavor. But it’s difficult to stomach the idea that a couple of scale score points (misinterpreted to boot) could make or break a career, or even be a significant factor in that determination. This is not a football game.**

To me, however, the truly revealing part of this small tussle was that most participants missed the forest for the trees. The data clearly indicate that DCPS’s average NAEP scores had been increasing fairly steadily for several years, spanning three different superintendents with rather divergent policy agendas. There was much more consistency than variation. So, if there’s any conclusion that might be (very cautiously) drawn here, it’s that short-term test scores are not particularly sensitive to who occupies the superintendent’s office.

Scores change for a variety of reasons, school and non-school, including measurement error, and – in the case of cross-sectional data such as NAEP – demographic changes between cohorts (mobility in DC schools is quite high).

Education is cumulative and multi-dimensional. Many interventions (especially those related to teacher quality) take years to produce a measurable effect, and they interact with existing policies as well as with non-school factors and policies (e.g., a massive recession). Untangling all of these factors is extremely difficult, and this precludes chalking up raw short-term cohort changes to the mere presence of one individual.***

Yet the public debate exemplified the exact opposite viewpoint. Instead of noting the relatively consistent trend, we were driven by the urge to impose some arbitrary causal structure on the data by chopping it up into discrete units defined by superintendencies, then attributing the slightest variation in the trend to each person’s leadership or policies.

Superintendents certainly influence results, but we all, myself included, need to stop with the reflexive inclination to reduce a complicated community/ teaching/ learning/ testing dynamic down to one person or set of policies. When we do that, the inherent limitations of test scores are further exacerbated. They become a political tool largely detached from empirical reality, and not a very sharp tool at that. We have let it get that way, but I think we’re better than that.

- Matt Di Carlo

*****

* It bears mentioning that these were differences in the annual change, and that one NAEP point is not as inconsequential as it might sound. I might also point out that, in this case, it’s possible that the difference in the trend was not statistically significant (I could not figure out how to test this using the NAEP Data Explorer).

** At this point, critics of the former DCPS chancellor would no doubt point out that they are no fans of high-stakes testing, but that it was Ms. Rhee who staked her own reputation on test scores – touting them relentlessly – and thus should be held to this standard. That’s fair enough, I suppose, but it’s a political rather than empirical argument, and one could easily assert that Ms. Rhee’s focus on testing results as the yardstick of success may have been unusually pronounced, but it did not set the standard; it reflected the standard.

*** A proper analysis of NAEP scores might be used to tease out some tentative conclusions about the possible policy-based causes of variation, but it would, at the very least, have to be multivariate and include the results of all districts/states (here’s an example).

This blog post has been shared by permission from the author.
Readers wishing to comment on the content are encouraged to do so via the link to the original post.
Find the original post here:

Shanker Blog

The views expressed by the blogger are not necessarily those of NEPC.