The ability of artificial intelligence to accurately assess the cognitive complexity of educational tasks is a critical benchmark for its practical utility in classrooms. A new study reveals that current AI models, including both general and education-specific tools, achieve only modest accuracy in this domain, exhibiting systematic biases and reasoning flaws that could mislead educators if deployed without oversight.
Key Takeaways
- Eleven AI tools were evaluated on classifying math tasks by cognitive demand, achieving an average accuracy of only 63%.
- Education-specific tools (Brisk, Coteach AI, Khanmigo, Magic School, School.AI) performed no better than general-purpose models (ChatGPT, Claude, Gemini, etc.).
- No tool exceeded 83% accuracy, and all struggled most with tasks at the lowest (Memorization) and highest (Doing Mathematics) cognitive levels.
- AI models showed a systematic bias toward classifying tasks as mid-level "Procedures," often overweighting surface text features over deep cognitive processes.
- Errors stemmed from flawed reasoning about multiple task aspects, not from ignoring relevant dimensions, and were often accompanied by persuasive, plausible-sounding explanations.
Evaluating AI's Ability to Classify Cognitive Demand in Math
The study aimed to determine if AI could reliably classify mathematical tasks using a research-based framework with four levels of cognitive demand: Memorization, Procedures Without Connections, Procedures With Connections, and Doing Mathematics. This classification is vital for teachers adapting curricula to maintain rigor and meet individual student needs. The researchers tested eleven AI tools—six general-purpose and five built for education—using straightforward prompts to approximate real-world teacher use.
The results were sobering. The aggregate accuracy across all tools was just 63%. The education-specific tools, which include platforms like Khanmigo from Khan Academy and Magic School, did not outperform their general-purpose counterparts. The top-performing tool still fell short of reliability, reaching a ceiling of 83% accuracy. Performance was weakest at the framework's extremes; AI tools consistently misclassified simple recall tasks (Memorization) and complex, non-routine problems (Doing Mathematics), defaulting instead to the middle "Procedures" categories.
Error analysis revealed a critical flaw: the AI's reasoning process. The tools frequently identified relevant features of a task but then incorrectly synthesized them. They exhibited a tendency to be swayed by surface-level textual cues—such as the presence of key words or the apparent length of a problem—over the underlying cognitive process required to solve it. Furthermore, every misclassification was accompanied by a confident, well-articulated explanation, a phenomenon that poses a significant risk of persuading and misleading novice teachers who might trust the AI's output.
Industry Context & Analysis
This study provides a crucial reality check in an edtech market surging with AI integration promises. The failure of specialized tools like Khanmigo or Magic School to outperform general models like ChatGPT-4 or Claude 3 is telling. It suggests that current "education-specific" AI may be primarily applying a thin layer of pedagogical prompting over the same underlying large language model (LLM) architectures, rather than being fundamentally retrained or fine-tuned on high-quality, domain-specific reasoning data. For context, the leading general models themselves are benchmarks on reasoning tasks; GPT-4 achieves approximately 86.4% on the MMLU (Massive Multitask Language Understanding) benchmark, but this study shows that raw knowledge does not translate to nuanced pedagogical classification.
The systematic bias toward middle-category classification mirrors a known LLM behavior: the tendency toward "averageness" or central tendency in ambiguous scenarios. This is a fundamental limitation of models trained to predict the most statistically likely next token, rather than to execute deliberate, structured reasoning. Unlike a dedicated educational algorithm that might use a rule-based or feature-engineering approach, LLMs lack a verifiable "chain of thought" for this specific task. The finding that errors came from incorrect reasoning about identified features, not from ignorance of them, points to a core challenge in AI alignment for professional domains.
This research connects to a broader industry trend of discovering the "soft underbelly" of generative AI in expert fields. Similar performance gaps have been identified in legal reasoning, medical diagnosis, and scientific literature review, where contextual nuance and multi-factor judgment are paramount. The edtech AI market, projected to grow from $2.5 billion in 2022 to over $25 billion by 2032, is racing to implement features. However, this study indicates that foundational capabilities for core teaching tasks—like curriculum analysis—are not yet mature, risking the deployment of persuasive but unreliable assistants.
What This Means Going Forward
For educators and school administrators, this study serves as a vital caution. It underscores that AI cannot yet be trusted as an autonomous tool for curricular adaptation or task classification. The persuasive nature of incorrect explanations means professional development must focus on AI literacy and critical evaluation, positioning teachers as informed skeptics and final arbiters of AI-generated content. The tools may be best used as brainstorming aids or for generating initial task ideas, which are then rigorously vetted by human expertise.
For edtech developers and AI companies, the path forward is clear. Simply packaging a general LLM with an educational wrapper is insufficient for high-stakes pedagogical tasks. The next generation of tools requires specialized fine-tuning on rigorously labeled datasets of educational tasks, paired with reasoning architectures that force the model to explicitly weigh cognitive dimensions. Techniques like retrieval-augmented generation (RAG) grounded in established educational frameworks, or the development of verifiable classification chains, could mitigate the current reasoning flaws. Benchmarking against this type of cognitive demand classification should become a standard metric, akin to how code generation models are evaluated on HumanEval.
The immediate watchpoint is whether this research influences procurement and implementation strategies in school districts. Will it slow the adoption of AI for lesson planning, or redirect investment toward tools that offer transparency and teacher control? Furthermore, as the underlying LLMs continue to evolve—with companies like OpenAI, Anthropic, and Google touting improved reasoning—replicating this study in 6-12 months will be essential to measure genuine progress versus incremental change. The ultimate beneficiaries will be students, but only if the technology supporting their teachers is built on a foundation of accurate, not just plausible, expertise.