Human or AI? Comparing Design Thinking Assessments by Teaching Assistants and Bots

Authors: Sumbul Khan, Wei Ting Liow, Lay Kee Ang

Abstract: As design thinking education grows in secondary and tertiary contexts, educators face the challenge of evaluating creative artefacts that combine visual and textual elements. Traditional rubric-based assessment is laborious, time-consuming, and inconsistent due to reliance on Teaching Assistants (TA) in large, multi-section cohorts. This paper presents an exploratory study investigating the reliability and perceived accuracy of AI-assisted assessment compared to TA-assisted assessment in evaluating student posters in design thinking education. Two activities were conducted with 33 Ministry of Education (MOE) Singapore school teachers to (1) compare AI-generated scores with TA grading across three key dimensions: empathy and user understanding, identification of pain points and opportunities, and visual communication, and (2) examine teacher preferences for AI-assigned, TA-assigned, and hybrid scores. Results showed low statistical agreement between instructor and AI scores for empathy and pain points, with slightly higher alignment for visual communication. Teachers preferred TA-assigned scores in six of ten samples. Qualitative feedback highlighted the potential of AI for formative feedback, consistency, and student self-reflection, but raised concerns about its limitations in capturing contextual nuance and creative insight. The study underscores the need for hybrid assessment models that integrate computational efficiency with human insights. This research contributes to the evolving conversation on responsible AI adoption in creative disciplines, emphasizing the balance between automation and human judgment for scalable and pedagogically sound assessment.

Link: https://arxiv.org/abs/2510.16069

Human or AI? Comparing Design Thinking Assessments by Teaching Assistants and Bots

Authors: Sumbul Khan, Wei Ting Liow, Lay Kee Ang

Abstract: As design thinking education grows in secondary and tertiary contexts, educators face the challenge of evaluating creative artefacts that combine visual and textual elements. Traditional rubric-based assessment is laborious, time-consuming, and inconsistent due to reliance on Teaching Assistants (TA) in large, multi-section cohorts. This paper presents an exploratory study investigating the reliability and perceived accuracy of AI-assisted assessment compared to TA-assisted assessment in evaluating student posters in design thinking education. Two activities were conducted with 33 Ministry of Education (MOE) Singapore school teachers to (1) compare AI-generated scores with TA grading across three key dimensions: empathy and user understanding, identification of pain points and opportunities, and visual communication, and (2) examine teacher preferences for AI-assigned, TA-assigned, and hybrid scores. Results showed low statistical agreement between instructor and AI scores for empathy and pain points, with slightly higher alignment for visual communication. Teachers preferred TA-assigned scores in six of ten samples. Qualitative feedback highlighted the potential of AI for formative feedback, consistency, and student self-reflection, but raised concerns about its limitations in capturing contextual nuance and creative insight. The study underscores the need for hybrid assessment models that integrate computational efficiency with human insights. This research contributes to the evolving conversation on responsible AI adoption in creative disciplines, emphasizing the balance between automation and human judgment for scalable and pedagogically sound assessment.

Link: https://arxiv.org/abs/2510.16069

Human or AI? Comparing Design Thinking Assessments by Teaching Assistants and Bots

Authors: Sumbul Khan, Wei Ting Liow, Lay Kee Ang

Abstract: As design thinking education grows in secondary and tertiary contexts, educators face the challenge of evaluating creative artefacts that combine visual and textual elements. Traditional rubric-based assessment is laborious, time-consuming, and inconsistent due to reliance on Teaching Assistants (TA) in large, multi-section cohorts. This paper presents an exploratory study investigating the reliability and perceived accuracy of AI-assisted assessment compared to TA-assisted assessment in evaluating student posters in design thinking education. Two activities were conducted with 33 Ministry of Education (MOE) Singapore school teachers to (1) compare AI-generated scores with TA grading across three key dimensions: empathy and user understanding, identification of pain points and opportunities, and visual communication, and (2) examine teacher preferences for AI-assigned, TA-assigned, and hybrid scores. Results showed low statistical agreement between instructor and AI scores for empathy and pain points, with slightly higher alignment for visual communication. Teachers preferred TA-assigned scores in six of ten samples. Qualitative feedback highlighted the potential of AI for formative feedback, consistency, and student self-reflection, but raised concerns about its limitations in capturing contextual nuance and creative insight. The study underscores the need for hybrid assessment models that integrate computational efficiency with human insights. This research contributes to the evolving conversation on responsible AI adoption in creative disciplines, emphasizing the balance between automation and human judgment for scalable and pedagogically sound assessment.

Link: https://arxiv.org/abs/2510.16069

How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation

Authors: Devanshu Sahoo, Vasudev Majhi, Arjun Neekhra, Yash Sinha, Murari Mandal, Dhruv Kumar

Abstract: The use of Large Language Models (LLMs) as automatic judges for code evaluation is becoming increasingly prevalent in academic environments. But their reliability can be compromised by students who may employ adversarial prompting strategies in order to induce misgrading and secure undeserved academic advantages. In this paper, we present the first large-scale study of jailbreaking LLM-based automated code evaluators in academic context. Our contributions are: (i) We systematically adapt 20+ jailbreaking strategies for jailbreaking AI code evaluators in the academic context, defining a new class of attacks termed academic jailbreaking. (ii) We release a poisoned dataset of 25K adversarial student submissions, specifically designed for the academic code-evaluation setting, sourced from diverse real-world coursework and paired with rubrics and human-graded references, and (iii) In order to capture the multidimensional impact of academic jailbreaking, we systematically adapt and define three jailbreaking metrics (Jailbreak Success Rate, Score Inflation, and Harmfulness). (iv) We comprehensively evalulate the academic jailbreaking attacks using six LLMs. We find that these models exhibit significant vulnerability, particularly to persuasive and role-play-based attacks (up to 97% JSR). Our adversarial dataset and benchmark suite lay the groundwork for next-generation robust LLM-based evaluators in academic code assessment.

Link: https://arxiv.org/abs/2512.10415

Agentic AI as Undercover Teammates: Argumentative Knowledge Construction in Hybrid Human-AI Collaborative Learning

Authors: Lixiang Yan, Yueqiao Jin, Linxuan Zhao, Roberto Martinez-Maldonado, Xinyu Li, Xiu Guan, Wenxin Guo, Xibin Han, Dragan Gašević

Abstract: Generative artificial intelligence (AI) agents are increasingly embedded in collaborative learning environments, yet their impact on the processes of argumentative knowledge construction remains insufficiently understood. Emerging conceptualisations of agentic AI and artificial agency suggest that such systems possess bounded autonomy, interactivity, and adaptability, allowing them to engage as epistemic participants rather than mere instructional tools. Building on this theoretical foundation, the present study investigates how agentic AI, designed as undercover teammates with either supportive or contrarian personas, shapes the epistemic and social dynamics of collaborative reasoning. Drawing on Weinberger and Fischer’s (2006) four-dimensional framework, participation, epistemic reasoning, argument structure, and social modes of co-construction, we analysed synchronous discourse data from 212 human and 64 AI participants (92 triads) engaged in an analytical problem-solving task. Mixed-effects and epistemic network analyses revealed that AI teammates maintained balanced participation but substantially reorganised epistemic and social processes: supportive personas promoted conceptual integration and consensus-oriented reasoning, whereas contrarian personas provoked critical elaboration and conflict-driven negotiation. Epistemic adequacy, rather than participation volume, predicted individual learning gains, indicating that agentic AI’s educational value lies in enhancing the quality and coordination of reasoning rather than amplifying discourse quantity. These findings extend CSCL theory by conceptualising agentic AI as epistemic and social participants, bounded yet adaptive collaborators that redistribute cognitive and argumentative labour in hybrid human-AI learning environments.

Link: https://arxiv.org/abs/2512.08933

css.php