Artificial intelligence tools show significant limitations in classifying the cognitive complexity of mathematics tasks, a critical function for curriculum adaptation, raising serious questions about their immediate utility in teacher planning and professional development. A new study reveals that both general-purpose and education-specific AI models struggle with this nuanced educational judgment, often providing misleadingly confident but incorrect analyses that could undermine instructional quality if relied upon by educators.
Key Takeaways
- Eleven AI tools—six general-purpose and five education-specific—were tested on classifying math tasks by cognitive demand, achieving an average accuracy of only 63%.
- No tool exceeded 83% accuracy, and education-specific tools performed no better than general-purpose models like ChatGPT or Claude.
- All tools exhibited a systematic bias, struggling most with tasks at the extremes (Memorization and Doing Mathematics) and defaulting to middle categories (Procedures with/without Connections).
- Error analysis revealed tools overweight surface textual features and fail to correctly reason about the multiple aspects that determine a task's true cognitive demand.
- The tools often generate plausible-sounding explanations for their misclassifications, a significant risk for novice teachers who may accept flawed AI analysis.
Evaluating AI's Ability to Classify Mathematical Cognitive Demand
The study, detailed in the preprint arXiv:2603.03512v1, directly addresses a pressing need in education: helping teachers efficiently adapt mathematics curricula to meet individual student needs while maintaining rigorous cognitive demand. Researchers evaluated eleven prominent AI tools on their ability to categorize mathematical tasks using a established, research-based framework with four levels of cognitive demand. The tested levels were Memorization, Procedures Without Connections, Procedures With Connections, and Doing Mathematics.
The AI cohort included six general-purpose models: ChatGPT (OpenAI), Claude (Anthropic), DeepSeek, Gemini (Google), Grok (xAI), and Perplexity. It also included five tools marketed for education: Brisk, Coteach AI, Khanmigo (Khan Academy), Magic School, and School.AI. The goal was to approximate the performance a teacher would get using straightforward, practical prompts, rather than highly engineered ones.
The results were underwhelming. The average accuracy across all tools was just 63%, with no single model surpassing 83%. Contrary to expectations, the education-specific tools showed no advantage over their general-purpose counterparts. A clear pattern of failure emerged: all tools performed poorly on tasks at the cognitive extremes—basic Memorization and high-level Doing Mathematics—while showing a systematic bias toward classifying tasks into the middle two procedural categories.
Industry Context & Analysis
This study arrives amid a surge of investment and hype around AI in education. Companies like Khan Academy (with Khanmigo) and a host of startups have raised significant funding—Khan Academy's AI initiative is backed by a $10 million grant from Microsoft—promising to revolutionize lesson planning and personalized learning. However, this research provides a crucial reality check, demonstrating that core pedagogical reasoning remains a substantial challenge for current models.
The finding that education-specific AI tools performed no better than general models like ChatGPT-4 or Claude 3 is particularly damning. It suggests that simply fine-tuning a base model on educational text or marketing it to teachers does not confer an understanding of deep, domain-specific frameworks like cognitive demand classification. This contrasts with AI performance in other standardized evaluation domains. For instance, leading models like GPT-4 routinely achieve scores above 85% on broad knowledge benchmarks like MMLU (Massive Multitask Language Understanding), but clearly falter when tasked with nuanced, context-dependent educational judgment.
The systematic bias toward middle categories and the generation of persuasive but incorrect explanations point to a fundamental technical limitation. Unlike retrieval-augmented generation (RAG) systems that can pull from verified sources, these tools are primarily performing pattern matching on their training data. They overweight keywords and surface structure (e.g., the presence of a "real-world context") while failing to holistically reason about the underlying cognitive processes required of a student, such as non-algorithmic thinking or the need for deep conceptual connections.
This follows a broader industry pattern where AI excels at generating content but struggles with reliable, consistent evaluation—a gap seen in areas like automated essay scoring and code review. The risk is acute in education, where a novice teacher might accept an AI's confident but flawed analysis, potentially leading to misaligned instruction that either bores or frustrates students.
What This Means Going Forward
For educators and school administrators, this study signals that current AI tools are not ready for unsupervised, high-stakes use in curriculum planning and task classification. They should be viewed as potential brainstorming aids or administrative assistants, not as pedagogical experts. Professional development for teachers must now include AI literacy components that focus on critically evaluating AI output, especially its tendency to sound convincing when wrong.
For EdTech companies and AI developers, the path forward requires moving beyond simple prompt interfaces. The research highlights the need for improved prompt engineering and, more fundamentally, for model architectures or training approaches that incorporate structured pedagogical frameworks directly. This could involve creating specialized "reasoning modules" for cognitive taxonomy or developing robust evaluation datasets co-created with expert teachers to fine-tune models more effectively.
The market will likely see a shakeout. Tools that are merely thin wrappers around general-purpose APIs will struggle to justify their value, while those that invest in deep, research-backed integration of educational science may gain a long-term advantage. The next milestone to watch will be whether any tool can reliably exceed 90% accuracy in a blinded test with teachers, a benchmark that would indicate true utility. Until then, the most valuable role for AI in this domain may be in augmenting human expertise—flagging potential issues for teacher review—rather than attempting to replace it.