This past week, the Center for Research on Education Outcomes (CREDO) at Stanford University released yet another report in a series on the effects of charter schools on test scores -- this time focusing on Texas.
Almost immediately, members of the the local media
trumpeted the results as "proof" that charter schools are realizing meaningful gains in student outcomes:
For the first time, Texas charter schools have outperformed traditional public schools in reading and closed the gap in math, researchers at Stanford University have found.
Students at Texas charter schools, on average, received the equivalent of 17 additional days of learning per year in reading and virtually the same level of education in math when compared to traditional public schools, according to a study
released Wednesday by the Center for Research on Education Outcomes, or CREDO.
Rather than looking at raw standardized test scores, CREDO researchers quantify the impact of a school by looking at student improvement on the tests relative to other schools. The researchers then translate those results into an equivalent number of "days of learning" gained or lost in a 180-day school year.
The center's staff produced similar analyses in 2013
, finding Texas charter schools had a negative impact on reading and math performance.
"The most recent results are positive in two ways. Not only do they show a positive shift over time, but the values themselves are both positive for the first time," the researchers wrote.
CREDO's studies of charter school performance are widely respected in education circles. The center compares students from charter and traditional public schools by matching them based on demographic characters -- race, economic status, geography and English proficiency, among others -- and comparing their growth on standardized tests. Scores from 2011-12 to 2014-15 were analyzed for the most recent report. [emphasis mine]
That's from the Houston Chronicle, which published just one paragraph suggesting the CREDO studies might have credible critics:
Skeptics of CREDO's study typically offer three main criticisms of the research: it focuses exclusively on standardized test results, incentivizing schools that "teach to the test"; it ignores other advantages of traditional public schools, such as better access to extracurricular activities; and it doesn't account for the fact that charter school students are more likely to have strong, positive parental influence on their education.
Sorry, but that's, at best, an incomplete description of the serious limitations of these studies, which include:
Here is how the CREDO Texas study reports its findings:
Stanley Pogrow published a paper earlier this year that didn't get much attention, and that's too bad. Because he quite rightly points out that it's much more credible to describe results like the ones reported here as "small" than as substantial. 0.03 standard deviations is tiny: plug it in here and you'll see it translates into moving from the 50th to the 51st percentile (the most generous possible interpretation when converting to percentiles).
I have been working on something more formal than a blog post to delve into this issue. I've decided to publish an excerpt now because, frankly, I am tired of seeing "days of learning" conversions reported in the press and in research -- both peer-reviewed and not -- as if there was no debate about their validity.
The fact is that many people who know what they are talking about have a problem with how CREDO and others use "days of learning," and it's well past time that the researchers who make this conversion justify it.
The excerpt below refers to what the eminent psychometrician Michael T. Kane coined a "validity argument." To quote Kane: "Public claims require public justification." I sincerely hope I can spark a meaningful conversation here and get the CREDO team to adequately and publicly justify their use of "days of learning." As of now, their validity argument is cursory at best -- and that's just not good enough.
I have added some bolding to the excerpt below to highlight key points.
* * *
Avoiding the Validity Argument: A Case Study
As an illustration of the problem of avoiding the validity argument in education policy, I turn to an ongoing series of influential studies of charter school effects. Produced by The Center for Research on Education Outcomes at Stanford University, the so-called CREDO reports have driven a great deal of discussion about the efficacy of charter school proliferation. The studies have been cited often in the media, where the effects they find are reported as “days of learning.” Both the National Charter School Study (Raymond et al., 2013) and the Urban Charter School Study Report on 41 Regions (CREDO, 2015) include tables that translate the effect sizes found in the study into “days of learning.” Since an effect size of 0.25 SD is translated into 180 days, the clear implication is that an effect of this size moves a student ahead a grade level (a typical school year being 180 days long). Yet neither study explains the rationale behind the tables; instead, they cite two different sources, each authored by economist Eric Hanushek, as the source for the translations.
The 2015 study (p. 5) cites a paper published in Education Next (Hanushek, Peterson & Woessmann, 2012) that asserts: “On most measures of student performance, student growth is typically about 1 full std. dev. on standardized tests between 4th and 8th grade, or about 25 percent of a std. dev. from one grade to the next.” (p. 3-4) No citation, however, is given to back up this claim: it is simply stated as a received truth.
The 2013 study (p. 13) cites a chapter by Hanushek in the Handbook of the Economics of Education (Hanushek & Rivkin, 2006), in which the author cites his own earlier work:
“Hanushek (1992) shows that teachers near the top of the quality distribution can get an entire year’s worth of additional learning out of their students compared to those near the bottom. That is, a good teacher will get a gain of 1.5 grade level equivalents while a bad teacher will get 0.5 year for a single academic year.” (p. 1068)
No other references are made within the chapter as to how student gains could be presented as years or fractions of a year’s worth of learning.
The 1992 citation is to an investigation by Hanushek of the correlation between birth order and student achievement, and between family size and student achievement. The test scores used to measure achievement come from the “Iowa Reading Comprehension and Vocabulary Tests.” (p. 88) The Iowa Assessments: Forms E and F Research and Development Guide(2015), which traces the development of the Iowa Assessments back to the 1970s, states:
“To describe the developmental continuum or learning progression in a particular achievement domain, students in several different grade levels must answer the same questions in that domain. Because of the range of item difficulty in the scaling tests, special Directions for Administration were prepared to explain to students that they would be answering some very easy questions and other very difficult questions.” (p. 55-56)
In other words: to have test scores reported in a way that allows for comparisons across grade levels (or, by extension, fractions of a grade level), the Iowa Assessments deliberately place the same questions across those grade levels. There is no indication, however, that all, or any, of the statewide tests used in the CREDO studies have this property.
Harris (2007) describes the process of creating a common score scale for different levels of an assessment as vertical scaling. She notes: “Different decisions can lead to different vertical scales, which in turn can lead to different reported scores and different decisions.” (p. 233) In her discussion of data collection, Harris emphasizes that several designs can be used to facilitate a vertical scale, such as a scaling test, common items, or single group to scaling test. (p. 241)
In all of these methods, however, there must be some form of overlapping: at least some students in concurrent grades must have at least some common items on their tests. And yet students in different grades still take tests that differ in form and content; Patz (2007) describes the process of merging their results into a common scale as linking (p. 6). He notes, however, that there is a price to be paid for linking: “In particular, since vertical links provide for weaker comparability than equating, the strength of the validity of interpretations that rest on the vertical links between test forms is weaker.” (p. 16)
So even if the CREDO studies used assessments that were vertically scaled, the authors would have to acknowledge that the validity of their effect sizes was at least somewhat compromised compared to effect sizes derived from other assessments. In this case, however, the point is moot: it appears that many of the assessments used by CREDO are not vertically scaled, which is a minimal requirement for making the case that effect sizes can be translated into fractions of a year’s worth of learning. The authors are, therefore, presenting their results in a metric that has not been validated and could be misleading.
I use this small but important example to illustrate a larger point: when influential education policy research neglects to validate the use of assessments, it may lead stakeholders to conclusions that cannot be justified. In the case of the CREDO reports, avoiding a validity argument for presenting effect sizes in “days of learning” has led to media reports on the effects of charter schools and policy decisions regarding charter proliferation that are based on conclusions that have not been validated. That is not to say these decisions are necessarily harmful; rather, that they are based on a reporting of the effects of charter schools that avoided having to make an argument for the validity of using test scores.
Center for Research on Education Outcomes (CREDO) (2015). Urban Charter School Study Report on 41 Regions. Palo Alto, CA: Center for Research on Education Outcomes (CREDO), Stanford University. Retrieved from: http://urbancharters.stanford.edu/summary.php
Hanushek, E. A. (1992). The trade-off between child quantity and quality. Journal of political economy, 100(1), 84-117.
Hanushek, E. A., & Rivkin, S. G. (2006). Teacher quality. Handbook of the Economics of Education, 2, 1051-1078.
Hanushek, E. A., Peterson, P. E., & Woessmann, L. (2012). Achievement Growth: International and US State Trends in Student Performance. PEPG Report No.: 12-03. Program on Education Policy and Governance, Harvard University.
Harris, D. J. (2007). Practical issues in vertical scaling. In Linking and aligning scores and scales (233-251). New York: Springer.
Kane, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement 50(1), 1–73.
Raymond, M. E., Woodworth, J. L., Cremata, E., Davis, D., Dickey, K., Lawyer, K., & Negassi, Y. (2013). National Charter School Study 2013. Palo Alto, CA: Center for Research on Education Outcomes (CREDO), Stanford University. Retrieved from: http://credo.stanford.edu/research-reports.html