Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring

Authors: Hong Jiao, Hanna Choi, Haowei Hua

Abstract: This study explored the utilities of rationales generated by GPT-4.1 and GPT-5 in automated scoring using Prompt 6 essays from the 2012 Kaggle ASAP data. Essay-based scoring was compared with rationale-based scoring. The study found in general essay-based scoring performed better than rationale-based scoring with higher Quadratic Weighted Kappa (QWK). However, rationale-based scoring led to higher scoring accuracy in terms of F1 scores for score 0 which had less representation due to class imbalance issues. The ensemble modeling of essay-based scoring models increased the scoring accuracy at both specific score levels and across all score levels. The ensemble modeling of essay-based scoring and each of the rationale-based scoring performed about the same. Further ensemble of essay-based scoring and both rationale-based scoring yielded the best scoring accuracy with QWK of 0.870 compared with 0.848 reported in literature.

Link: https://arxiv.org/abs/2510.27131

Exploring the Applications of Generative AI in High School STEM Education

Authors: Ishaan Masilamony

Abstract: In recent years, ChatGPT \cite{openai_2023_gpt4} along with Microsoft Copilot have become subjects of great discourse, particularly in the field of education. Prior research has hypothesized on potential impacts these tools could have on student learning and performance. These have primarily relied on trends from prior applications of technology in education and an understanding of the limitations and strengths of Generative AI in other applications. This study utilizes an experimental approach to analyze the impacts of Generative AI on high school STEM education (physics in particular). In accordance with most findings, generative AI does have some positive impact on student performance. However, our findings have shown that the most significant impact is an increase in student engagement with the subject.

Link: https://arxiv.org/abs/2510.21718

Hybrid Instructor Ai Assessment In Academic Projects: Efficiency, Equity, And Methodological Lessons

Authors: Hugo Roger Paz

Abstract: In technical subjects characterized by high enrollment, such as Basic Hydraulics, the assessment of reports necessitates superior levels of objectivity, consistency, and formative feedback; goals often compromised by faculty workload. This study presents the implementation of a generative artificial intelligence (AI) assisted assessment system, supervised by instructors, to grade 33 hydraulics reports. The central objective was to quantify its impact on the efficiency, quality, and fairness of the process. The employed methodology included the calibration of the Large Language Model (LLM) with a detailed rubric, the batch processing of assignments, and a human-in-the-loop validation phase. The quantitative results revealed a noteworthy 88% reduction in grading time (from 50 to 6 minutes per report, including verification) and a 733% increase in productivity. The quality of feedback was substantially improved, evidenced by 100% rubric coverage and a 150% increase in the anchoring of comments to textual evidence. The system proved to be equitable, exhibiting no bias related to report length, and highly reliable post-calibration (r = 0.96 between scores). It is concluded that the hybrid AI-instructor model optimizes the assessment process, thereby liberating time for high-value pedagogical tasks and enhancing the fairness and quality of feedback, in alignment with UNESCO’s principles on the ethical use of AI in education.

Link: https://arxiv.org/abs/2510.22286

The AI Tutor in Engineering Education: Design, Results, and Redesign of an Experience in Hydrology at an Argentine University

Authors: Hugo Roger Paz

Abstract: The emergence of Generative Artificial Intelligence (GenAI) has reshaped higher education, presenting both opportunities and ethical-pedagogical challenges. This article presents an empirical case study on the complete cycle (design, initial failure, redesign, and re-evaluation) of an intervention using an AI Tutor (ChatGPT) in the “Hydrology and Hydraulic Works” course (Civil Engineering, UTN-FRT, Argentina). The study documents two interventions in the same cohort (n=23). The first resulted in widespread failure (0% pass rate) due to superficial use and serious academic integrity issues (65% similarity, copies > 80%). This failure forced a comprehensive methodological redesign. The second intervention, based on a redesigned prompt (Prompt V2) with strict evidence controls (mandatory Appendix A with exported chat, minimum time $\geq$ 120 minutes, verifiable numerical exercise) and a refined rubric (Rubric V2), showed significantly better results: a median score of 88/100 and verifiable compliance with genuine interaction processes. Using a mixed-methods approach (reproducible document analysis and rubric analysis), the impact of the redesign on integrity and technical performance is evaluated. The results demonstrate that, without explicit process controls, students prioritize efficiency over deep learning, submitting documents without real traceability. A transferable assessment protocol for STEM courses is proposed, centered on “auditable personal zones,” to foster higher-order thinking. The study provides key empirical evidence from the context of a public Latin American university.

Link: https://arxiv.org/abs/2510.22279

MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

Authors: Tengchao Yang, Sichen Guo, Mengzhao Jia, Jiaming Su, Yuanyang Liu, Zhihan Zhang, Meng Jiang

Abstract: Effective math tutoring requires not only solving problems but also diagnosing students’ difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promise, existing benchmarks largely overlook these tutoring skills. We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps. Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions, and structured into three tasks-Insight Discovery, Operation Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find clear performance gaps between proprietary and open-source systems, substantial room compared to human tutors, and consistent trends across input variants: OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and our rubric-based LLM-as-a-Judge proves highly reliable. These results highlight both the difficulty and diagnostic value of MMTutorBench for advancing AI tutoring.

Link: https://arxiv.org/abs/2510.23477

css.php