2016 Bunkum Award Honoree:

Center for American Progress for Lessons From State Performance on NAEP

The 2016​ Bunkum Award for shoddy research goes to the Center for American Progress (CAP) for Lessons From State Performance on NAEP: Why Some High-Poverty Students Score Better than Others, authored by Ulrich Boser and Catherine Brown. 

The CAP report is based on a correlational study with the key finding that high standards increase learning for high-poverty students. The researchers compared changes in states’ test scores for low-income students to changes to those states’ standards-based policy measures as judged by the researchers. Their conclusions were that high standards lead to higher test scores and that states should adopt and implement the Common Core.

Alas, there was much less than met the eye.

In choosing the worst from among the many “worthy” contenders, the evaluation criteria applied to the year’s reports were drawn from two separate short papers entitled, Five Simple Steps to Reading Research and Reading Qualitative Educational Policy Research.

 

Here’s how the CAP report scored:

  • Was the design appropriate?   No: The design was not sensitive, so they tossed in “anecdotes” and “impressions.”

The apparent purpose of the paper was to advocate for standards-based reform, particularly for the Common Core State Standards, by demonstrating a correlational association between better NAEP scores and states with stronger and better standards and assessments. The data could do little to support this conclusion, so the report largely relied on evidence the authors repeatedly acknowledged as “anecdotal.”

  • Were the methods clearly explained?   No: The methods section is incomplete and obtuse.

The report claims the authors studied five two-year implementation cycles, from 2003 to 2013, but the results from these time periods were not presented. The reader is at a loss as to whether the data were missing, mushed together across years, too weak to present, had a spike with Common Core implementation, or something else. The authors apparently used only one aggregated “policy implementation score,” which was derived from “norming” each of three “categories” and then averaging the three categories together. The categories were derived from the 14 scaled “indicators.” Apparently they regressed this policy implementation score against NAEP scores. While there exist useful measures of scale integrity, the report includes no analysis of the effect of the de facto amalgamation of 14 different variables into one. A finance measure was included but had no statistical effect on the outcomes; why this measure was reported is not clear. Finally, the methods presentation left out critical data and was incomplete.

  • Were the data sources appropriate?   No: The variables used were inadequate and were aggregated in unclear ways.

The report’s goal was to attempt to isolate and determine the relationship between state standards and student achievement. But test-score differences between, say, Massachusetts, Kentucky and Michigan, likely vary for many, many reasons beyond the particular standards that each state adopted, and the study does not control for the vast majority of these likely reasons. The study also includes no measure of the quality or fidelity of the implementation of the standards themselves.

  • Were the data gathered of sufficient quality and quantity?   No: The report uses just state-level NAEP scores and summary data.

This was a correlational study of convenience measures adopted from an Education Week publication. Without knowing more about other factors potentially impacting NAEP test scores in each state, and without knowing about the implementation process for each state’s standards, it is difficult to see how data about the simple act of a state adopting standards is sufficient. Readers are asked to accept the authors’ conclusion that “rigorous” implementation of standards was effective for the favored states. But even if a reader were to accept this premise, “rigor” was never directly addressed in the study.

  • Were the statistical analyses appropriate?   No: A multiple correlation with just 50 cases is too small.

Conducting multiple regressions with 50 cases is not an appropriate methodological approach. Not surprisingly, the resulting effect sizes were quite small. The authors acknowledge—even while they don’t restrain their claims or conclusions—that this analysis is “anecdotal” and “impressionistic.” For example:

While there is an important debate over the definition of standards-based reform—and this analysis is undoubtedly anecdotal and impressionistic—it appears clear that states that have not embraced the approach have shown less success, while more reform-oriented states have shown higher gains over the long term. (p. 2)

  • Were the analyses properly executed?   Cannot be determined: The full results were not presented.

The authors added together the 14 state policy variables of interest, and they then regressed this “change in policy implementation score” against NAEP scores from the previous two years. Since the report did not include specific results (for example, they do not include the multiple R’s, a correlation matrix, or the 14 predictor variables), or why (or how) they were weighted and added together, the reader cannot tell whether the analyses are properly executed.

  • Was the literature review thorough and unbiased?   No: The report largely neglected peer-reviewed research.

Of the 55 references, only one was clearly peer-reviewed. Despite a rich literature on which the authors could have drawn, the report’s literature review over-relied (and attempted to replicate, in many ways) a single non-peer reviewed source from 2006.

  • Were the effect sizes strong enough to be meaningful?   Effect sizes were not presented, and the claims are based on the generally unacceptable 0.10 significance level.

Although effect sizes can be estimated from correlations (e.g., Cohen’s D), only the results from one of the five two-year contrast panels were reported. The single table, which appears in the report’s appendix, purports to show a small relationship between the standards policies and NAEP scores, but this relationship is significant only at the 0.10 level, and even then only for 4th grade math and 8th grade reading–but not for 8th grade math and 4th grade reading, where the weak relationships are negative. It is generally not acceptable to claim significance at the 0.10 level.

  • Were the recommendations supported by strong evidence?   No: Their conclusion is based on weak correlations.

Despite the authors’ claim that “Our findings suggest that there is clear evidence that standards-based reform works, particularly when it comes to the needs of low-income students,” an objective reader of these data and analyses could easily come to exactly the opposite conclusion: that there is no demonstrated relationship.

The fundamental flaw in this report is simply that it uses inadequate data and analyses to make a broad policy recommendation in support of the common core state standards. A reader may or may not agree with the authors’ conclusion that “states should continue their commitment to the Common Core’s full implementation and aligned assessments.” But that conclusion cannot and should not be based on the flimsy analyses and anecdotes presented in the report.

 

​Bunkum Background​ 

Many organizations publish reports they call research. But what does this mean? These reports often are published without having first been reviewed by independent experts — the “peer review” process commonly used for academic research.

Even worse, many think tank reports subordinate research to the goal of making arguments for policies that reflect the ideology of the sponsoring organization.

Yet, while they may provide little or no value as research, advocacy reports can be very effective for a different purpose: they can influence policy because they are often aggressively promoted to the media and policymakers.

To help the public determine which elements of think tank reports are based on sound social science, NEPC’s “Think Twice” Project has, every year for the past decade, asked independent experts to assess these reports’ strengths and weaknesses.

The results have been interesting. Some advocacy reports have been found by experts to be sound and useful, but most are found to have little if any scientific merit.

At the end of each year, the editors at NEPC sift through the think tank reports that had been reviewed, to identify the worst offender.

We then award the organization publishing that report NEPC’s Bunkum Award for shoddy research.

 

Find the report at:
Boser, U., & Brown, C. (2016, January 14). Lessons From State Performance on NAEP: Why Some High-Poverty Students Score Better than Others. Washington, DC: Center for American Progress. Available online at https://cdn.americanprogress.org/wp-content/uploads/2015/12/23090515/NAEPandCommonCore.pdf

Find the review at:
Nichols, S.L. (2016). Review of “Lessons From State Performance on NAEP: Why Some High-Poverty Students Score Better Than Others.” Boulder, CO: National Education Policy Center. Available online at http://nepc.colorado.edu/thinktank/review-CAP-standards