Code Acts in Education: The Problem With Evidence Production on AI in Education
The research field of artificial intelligence in education (AIED) has a long history of investigating the effects of AI on learning and other educational outcomes. Calls for more evidence on AI in education have grown louder since the arrival of generative AI. The result is that countless AIED studies have been completed or are underway to find statistically significant evidence of its effects. These efforts to enumerate the effects of AI, however, vary enormously in terms of quality.
The quality problem in AIED research was illustrated recently by the retraction of a paper published by SpringerNature in 2025 due to concerns about its integrity. The article, “The effect of ChatGPT on students’ learning performance, learning perception, and higher-order thinking,” was presented as a meta-review of the existing literature, and concluded with a series of headline-grabbing claims about the benefits of ChatGPT for students. Those claims, however, were highly disputed, leading eventually to the article’s retraction – though not before the paper had circulated widely on social media and been cited hundreds of times in subsequent research papers.
The retraction of one paper alone may not address what appears to be a wider quality control and rigour problem in AIED measurement research. Several recent critical publications demonstrate that the literature on the effects of AI in education is now contaminated with a significant quantity of poor-quality quantitative research. This is a major problem at a time when policymakers, educational specialists and the public alike are all eager for evidence of whether AI helps improve educational outcomes or not.
Meta-reviewing junk science
The dominant form of research about AI in education is the media comparison study, which sets out to measure AI effects on learning outcomes in comparison to a control group. Their concern is with the statistical measurement of student performance and with the efficacy of a specific technology to improve learning outcomes. These are the kinds of experimental studies that later get packaged up in meta-analyses. Originating in medical research, meta-analyses are supposed to be a “gold standard” evidence synthesis method. The purpose of such statistical meta-studies is to synthesize large quantities of data from existing studies into an aggregated and summarized form, in ways that helps confer on the findings greater significance and salience.
The field of AIED produces evidence synthesis studies such as meta-analyses at an industrial rate and scale, though questions have long been asked about their rigour and integrity. In a new paper just out in Computers & Education, Alyssa Lawson and colleagues examined two meta-analyses claiming that ChatGPT has beneficial effects on student performance. (One of these was the study retracted by SpringerNature, though the authors’ analysis was completed before the retraction.) Specifically, they analyzed the research designs and results in the underlying studies that were aggregated by the two reviews.
They found the review results were seriously undermined when they examined both “the extent to which ChatGPT and control conditions were comparable on their instructional features,” and “the percentage of comparisons that did or did not involve comparable conditions across various direct learning outcomes” in the reviewed literature. Most of that underpinning research, they concluded, failed to properly compare experimental and control conditions, and the reviews mixed up a variety of measurements, often with small samples or missing information, for what they claimed were evidence of “learning outcomes.”
As a result, the researchers argued that the existing literature does not support causal claims that ChatGPT is effective in improving learning outcomes.
The reviewed evidence base does not seem to support unequivocal claims about the causal effectiveness of ChatGPT in education. While the reviewed studies often report statistically significant performance gains in ChatGPT conditions, the pervasive lack of methodological control and lack of detailed instructional information makes it difficult to know whether these changes should be attributed to the technology itself or to the instructional practices being used alongside ChatGPT.
While the analysis only looked at two meta-reviews, it revealed wider issues with the state of AIED media comparison research, with the authors arguing for improvements in “the methodological standards in the emerging field of generative AI in education.”
Meta-meta-analyzing AIED
These methodological concerns are reflected in another recent review of the literature about the effects of AI on learning. Working with a much larger sample of published meta-studies, it casts serious doubt on the quality, reliability and validity of many more of these evidence syntheses and their findings.
The paper by František Bartoš and coauthors, “Effect of artificial intelligence on learning: a meta-meta-analysis,” published on the OSF pre-print server (which means it’s not peer reviewed yet), reports on a study that explored the results from almost 70 “meta-analyses” of AI effects on learning. Most of these meta-analyses conclude that AI has statistically significant positive effects on learning outcomes. The purpose of the “meta-meta-analysis” was to assess these findings. Its results suggest these claims about AI are seriously misleading.
In summary, our results highlight severe issues in the literature on the effect of AI/LLMs on learning. The results are plagued by publication bias and display extreme between-study heterogeneity. While we believe it is a priori plausible that AI/LLMs may have a positive impact on learning, the current empirical evidence base is insufficiently diagnostic and does not warrant concrete recommendations for educational practice or policy.
Additionally, a further forensic analysis of a sample of AIED meta-analyses by Patrick O’Neill (also just posted as a pre-print on OSF) reported that “none of the examined meta-analyses provided a valid basis for the claims they advanced: none had a coherent construct, none sufficiently assessed publication bias, and all had severe heterogeneity.”
In the sample overall, he argued, “statistics were misapplied, miscalculated, and misinterpreted.” O’Neill’s conclusions in the paper are blunt: “methodological failure is chronic” in the AIED literature; the “flawed evidence” produced by such studies circulates through reputable journals and social media networks to appear far more reliable, valid and accurate than it is; and a “cascade of errors” can then become the basis of policy recommendations.
These failures are consequential in a policy environment where AI adoption is increasingly treated as urgent and inevitable. The audited meta-analyses are heavily cited…. These findings show how AIED meta-analysis can manufacture certainty from an evidence base that is heterogeneous, positively selected, and weakly vetted.
In other words, according to these critical assessments of AIED meta-analyses, researchers have tended to selectively pick for positive evidence and have aggregated results despite wide variety in the original measures recorded. This has led to overinflated results and misplaced claims about the strength of the evidence base on AI’s effects on learning.
These serious weaknesses and failures are not only found in a small number of outlier articles, but appear to characterize a significant portion of the available literature, including that published in highly-ranked journals by reputable publishers. Collectively, the literature manufactures the idea that an AIED evidence base has been established and that the research is appropriate to inform policy and practice.
These recent critical studies of AIED meta-analyses, then, suggest that the quality and reliability of evidence about AI effects in education is currently questionable, casting doubt on the integrity, evidence standards and methodological rigour of much of the literature. They provide empirical confirmation of repeated critiques of the evidence quality and methodological standards underpinning meta-studies of AI in education that have been produced in recent years.
Helen Beetham, for example has previously written a critical assessment of the production of research around AI in education:
Researchers have to get in quick to catch the wave, and to do that, they have to hand some agency over to quantitative technologies and techniques. And numbers tell their own story. Simpler than concerns about ethics. More direct than nuanced and necessarily circumspect judgements about the quality of other people’s research.
Rather than AI evidence quality being a problem limited only to a few studies, Beetham suggests, it is endemic as AIED researchers have rushed to synthesize the existing research, often using automated tools to do so at speed while overlooking critical questions of quality and ethics in the process.
Even AIED field insiders characterize much of this kind of research as “fast science” that lacks rigorous design and frequently makes unfounded claims.
As with technologies of the past, the allure of new technologies can lead to hasty research and unsound implementation, potentially undermining learning experiences for countless students globally. Educational technology researchers are tasked with identifying conditions of success and instructional pitfalls through research that holds up in the face of scrutiny. Without careful, deliberate research, we cannot hope to speak to policy, nor can we improve the lives of our students.
The recent retraction of the ChatGPT effects paper is an encouraging sign that overinflated claims and shoddy evidence practices are now being questioned and challenged. However, the critical analyses of the state of the AIED literature clearly indicate that major challenges remain. And, as Ilkka Tuomi has recently shown, other meta-analyses of AI in education continue to be published by reputable publishers with similar methodological problems.
Big tech AIED evidence making
The evidence problem in AIED is not just with one rotten paper, then, but appears to be field-wide. This systemic problem of fast science means that much AIED effects measurement research remains seriously inadequate for any policy or practice discussions or decisions.
The effect of this is likely to be that academic AIED will be sidelined as a serious scientific source of policy-relevant evidence. Other evidence producing bodies will step in with claims of rigour and reliability. The organizations to do so will likely be big tech companies. Already, Google is conducting randomized controlled trials of its Gemini-based AI tutors in schools in Sierra Leone and the UK. OpenAI has developed a “measurement suite” to measure the learning outcomes of its Study Mode application of ChatGPT.
Academic AIED should have proven itself a trusted source of research, with evidence-based insights and advice to help inform urgent decision-making in the years since generative AI went mainstream. It has, for the most part, proven inadequate in relation to that challenge, as even field insiders concede. This is a serious problem going forwards since the evidence base now appears completely contaminated with low-quality research that should not be trusted as a source for policy or practice application.
The result is that big tech may become the main source of trusted evidence on AI in education. And that evidence is likely to circulate widely, supported by persuasive commercial PR and marketing, and has the potential to seriously shape perceptions and decision making in the education sector.
This blog post has been shared by permission from the author.
Readers wishing to comment on the content are encouraged to do so via the link to the original post.
Find the original post here:
The views expressed by the blogger are not necessarily those of NEPC.