Code Acts in Education: Enumerating AI Effects in Education
Over recent weeks, controversy has surfaced over research publications claiming to show statistical evidence that generative AI has beneficial effects on learning. Two recent meta-analyses of ChatGPT effects and a research article published by the World Bank claimed to show that AI leads to measurable learning improvements. Critics have begun countering that these studies are methodologically flawed, overhyped and misleading in their conclusions.
Before we get into the critiques, first it’s notable to look at the reception of these publications. These studies were all circulated online with the support of a number of big-follower accounts.
For instance, the most recent meta-analysis—titled “The effect of ChatGPT on students’ learning performance, learning perception, and higher-order thinking: insights from a meta-analysis”—was published on 6 May. By the time of writing (28 May) it has an Altmetric score of 365 and been accessed 386,000 times. Its Altmetric score consists of almost 200 mentions on Xitter, and another 150 on Bluesky, plus 13 news mentions and a handful of Reddits. (Who knows how high it would be if Linkedin posts counted?)
The previous meta-analysis—titled “Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies”—has been cited in other published research articles 42 times already (according to Google Scholar), even though it was only published in December 2024. It’s included in the references of the more recent meta-analysis. That’s a phenomenal rate of initial article reception, citation, and recirculation of findings in subsequent publications.
The most recent publication raising current concern is a World Bank study—”From Chalkboards to Chatbots: Evaluating the Impact of Generative AI on Learning Outcomes in Nigeria”—that reports learning improvements from an after-school tutoring program using Microsoft Copilot. It was widely circulated online as evidence that “AI tutoring” has a significant impact on learning gains after only a short intervention. When the results were previewed in a World Bank blog post earlier in the year, it led to dramatic headlines such as “AI tutoring helps Nigerian students gain two years of learning in six weeks.”
At the same time, a Hechinger Report article summarized two other studies showing the deleterious effects of AI on learning processes. AI, its suggests, diminishes critical thinking and leads to “cognitive offloading” and “metacognitive laziness.”
But this evidence of “harms” caused by AI may be undermined by the same methodological problems as those purporting to show causal evidence of improvement effects. The underlying problem is that there is current desperation to show the causal “effects” of AI in education—whether good or bad—and this is leading to a rush of studies that immediately gather huge public and media attention despite their significant methodological shortcomings and limitations.
In a concise response to this situation, Tim Fawns put the problem well:
We are really, really bad at this kind of research and way too accepting of it. AI tutoring doesn’t have an effect on learning—it depends on context, purposes, methods and how it is situated with wider learning practice, processes, journeys. AI doesn’t cause “learning loss” or “cognitive offloading”—how thinking is differently performed and distributed depends on those same things. … [T]rying to isolate effects of a technology from the practices in which it is enacted leads to overly simplistic views which are not a good basis for action.
Clearly there is a big problem with the current circulation of evidence regarding AI in education, and it is quickly becoming a controversial topic that’s attracting critical attention. One part of the problem is scientific practice in the field itself; the other is how easily, quickly and widely the statistical findings can travel.
Fast science
The field of AI in education research has a long history, established methods, and scholarly journals through which its results and claims are assessed. But the recent rush of research and results exceeds this field and its standards. This means research of extremely dubious quality and provenance is being published with industrial frequency. The meta-analyses mentioned above, for example, identified thousands of candidate articles published just in the 2.5 years since ChatGPT was released.
Commenting on these meta studies, Ilkka Tuomi highlighted that the most recent did not even feature any clear quality control standards. This meant it featured in its analysis previous studies that may not have been rigorously peer-reviewed and that appeared in low-rated journals.
This points to a growing issue with AIED research: weak evidentiary standards. It is, to be blunt, relatively easy to publish AIED studies in journals with low-quality peer review and high-speed editorial and publishing processes. That allows papers of dubious quality to get captured in the search process for a meta-analysis and then bundled up as conclusive evidence in published meta studies in more reputable journals. The aggregated results and headline findings then get circulated by big social media accounts.
And the quality really is questionable. In a lengthy, in-depth methodological assessment of “The Good, the Bad, and the Ugly Science of AI in Education,” Wess Trabelski found that much of this literature features experimental design flaws that invalidate the findings, predictable results, entirely speculative and overblown conclusions that are not supported by the data, and even that some “represent flagrant violations of scientific integrity that somehow slipped through the supposedly rigorous peer review process.”
The problem is especially acute when AIED meta studies are published in what are seen as “gold standard” publications, despite the questionable provenance of the underlying evidence they report on. “Garbage in, gold out” as Ilkka Tuomi phrased it. “Junk science” transforms into a seemingly credible part of the evidence base when it is packaged up with dozens of similar studies in a high-ranked journal and circulated as aggregate causal proof that AI affects learning.
As field leaders in AIED research have themselves noted, much of this “fast science” on AI effects lacks validity and fails to separate instructional methods and technical affordances.
Enumerating AIED
The problem with the meta studies, then, is that the underlying papers on which the analyses are based should at best be treated with caution. This also became apparent with the publication of the World Bank study of AI tutoring. Despite its rapid circulation online, a number of methodological critics have cast serious doubt on its findings.
According to Betsy Wolf, its intervention effect size calculations appears compelling, but are fundamentally flawed. One of its outcome measures (“knowledge of AI and digital skills”), for example, is “overaligned” with Copilot itself: “you would expect Copilot students to have more familiarity with AI.” It also misleadingly equates the effects of the six week study itself with “annual learning gain.”
Michael Pershan additionally argued that there are “some big, glaring issues with their experiment that undermine their splashy result in obvious ways.” The primary issue is that the study compared a group of students who received additional tutoring—including with MS Copilot—while a “control group” received nothing in addition to their usual classes. In other words, the treatment group received significantly more tuition than the others.
“All this study is capable of showing,” Pershan argued as a result, is “that the program wasn’t literally a waste of time”:
Because the control group, academically speaking, didn’t do anything. I mean presumably they did lots of things—played soccer, hung out with friends, cooked food, whatever. But fundamentally the after-school group studied more and the control group did not.
On this basis, all the study can really be said to show is that additional tutoring helps improve short-term learning outcomes over routine school attendance, but very little about the effects of AI tutoring at all, and still less about the effects of AI on long-term learning gains (which was the headline finding shared virally online). Even an appreciative evaluation of the project noted that “projecting from six weeks to full academic years requires enormous leaps that go beyond what the data can support.” The study still spread wildly online on its release due to the big learning gain headline.
Taking a step back from the methodological minutiae, a more sociological read of this episode may be helpful. It reveals something of the social power of statistics and their circulation.
Ultimately, what we are looking at here is a rapid proliferation of mostly small-scale, localized, context-dependent studies of AI in education. These have then been generalized beyond what the data can accurately support. The results are now routinely shared online as statistical proof that AI improves “learning performance” or “learning outcomes” or similar.
It’s a good example of how authority has come to be associated with numbers and their presumed objectivity. There is great “trust in numbers” as sociologists of quantification would say. Numbers are taken to be inherently truthful, evidenced by “effect sizes” and “statistical significance”.
Many recent claims about AI in education depend on such faith in quantification and its precise underpinning statistical methods. The World Bank study, for instance, reported “twice the effect of some of the most effective interventions in education.” Such numerical claims appear objective, authoritative, and trustworthy. The enumeration of AIED effects is consequential in shaping opinion and potentially influencing political decisions concerning AI in education.
Viral science
Numbers, though, are always in reality social and technical accomplishments. Choices are made at every stage about what to count, how to analyze, with what calculating devices, in what contexts, and under what conditions. Social factors are involved in the production of statistics all the way from inception, funding, design, data collection, analysis and publication through to contexts of circulation, reception, and commentary.
Part of the power of numbers, additionally, is that they are easily transportable. They can travel far beyond the contexts in which they were generated, and the contexts they “represent” or “reflect,” whether in their original numerical format or as graphs and tables.
This is significant since numbers can be easily transported and put forward as proof that reflects a reality, with all the contextual and sociotechnical contingencies of their production hidden away.
Such a process was evident, for example, in how the recent World Bank study was reported with a simple graph of the findings, which was reproduced by many who posted it online, and by the ways the statistical effect size was reported as objective evidence of the effectiveness of the Copilot intervention. The actual contextual factors underpinning the study, and the details of its implementation—particularly its control group—were conveniently backgrounded, especially as the study’s headline findings circulated on social media.
In other words, the widely-reported statistics on the effects of AI had to be made, interpreted, de-contextualized, hyped-up, universalized, and were then made portable on platforms that promote virality. There’s not necessarily anything deceptive or sinister about this, but it certainly confers personal, organizational, reputational and citational advantage on those who can lay claim to producing “policy relevant” quantitative evidence.
It will not be surprising in only months to come to find that the results of the World Bank Study, as well as the meta studies, have been packaged up in policy reports as well as accessed thousands more times and cited as evidence of the impact of AI on educational outcomes. The enumeration and online virality of AIED effects is likely to give these studies considerable social and political force, particularly in contexts that purport to value “evidence-based policy” that “follows the science.”
Then this seemingly objective, quantified evidence could be used to support policy plans in contexts thousands of miles away and in completely different social, political and economic situations to those of the original site of enumeration.
Such “fast” evidence production and circulation stands in distinct contrast to recent qualitative social science accounts of AI in education that foreground the contextual and subjective factors that affect students’ uses of and sensemaking with AI applications. But this form of slower, more cautious, critical contextualized, and less generalizable science is less likely to travel and gain big attention scores than highly portable statistical results. That’s despite it offering more up-close insights into the ways AI is interweaving into educational practices and learning processes.
Perhaps an important strand of social science research could also explore the social and technical production of AIED evidence in depth, and help better understand all the factors that affect how such evidence is produced, circulated, and received in a social context that values and overhypes viral science.
This blog post has been shared by permission from the author.
Readers wishing to comment on the content are encouraged to do so via the link to the original post.
Find the original post here:
The views expressed by the blogger are not necessarily those of NEPC.