Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays

Authors: Haowei Hua(1), Hong Jiao(2), Xinyi Wang(3) ((1) Princeton University, (2) University of Maryland, College Park, (3) University of Maryland, College Park & Beijing Normal University)

Abstract: BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.

Link: https://arxiv.org/abs/2510.22830

CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning

Authors: Hamed Mahdavi(Pennsylvania State University), Pouria Mahdavinia(Pennsylvania State University), Alireza Farhadi(Amirkabir University of Technology), Pegah Mohammadipour(Pennsylvania State University), Samira Malek(Pennsylvania State University), Majid Daliri(New York University), Pedram Mohammadipour(Amirkabir University of Technology), Alireza Hashemi(City University of New York), Amir Khasahmadi(Autodesk), Vasant Honavar(Pennsylvania State University)

Abstract: State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.

Link: https://arxiv.org/abs/2510.27094

Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring

Authors: Hong Jiao, Hanna Choi, Haowei Hua

Abstract: This study explored the utilities of rationales generated by GPT-4.1 and GPT-5 in automated scoring using Prompt 6 essays from the 2012 Kaggle ASAP data. Essay-based scoring was compared with rationale-based scoring. The study found in general essay-based scoring performed better than rationale-based scoring with higher Quadratic Weighted Kappa (QWK). However, rationale-based scoring led to higher scoring accuracy in terms of F1 scores for score 0 which had less representation due to class imbalance issues. The ensemble modeling of essay-based scoring models increased the scoring accuracy at both specific score levels and across all score levels. The ensemble modeling of essay-based scoring and each of the rationale-based scoring performed about the same. Further ensemble of essay-based scoring and both rationale-based scoring yielded the best scoring accuracy with QWK of 0.870 compared with 0.848 reported in literature.

Link: https://arxiv.org/abs/2510.27131

MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

Authors: Tengchao Yang, Sichen Guo, Mengzhao Jia, Jiaming Su, Yuanyang Liu, Zhihan Zhang, Meng Jiang

Abstract: Effective math tutoring requires not only solving problems but also diagnosing students’ difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promise, existing benchmarks largely overlook these tutoring skills. We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps. Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions, and structured into three tasks-Insight Discovery, Operation Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find clear performance gaps between proprietary and open-source systems, substantial room compared to human tutors, and consistent trends across input variants: OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and our rubric-based LLM-as-a-Judge proves highly reliable. These results highlight both the difficulty and diagnostic value of MMTutorBench for advancing AI tutoring.

Link: https://arxiv.org/abs/2510.23477

The AI Tutor in Engineering Education: Design, Results, and Redesign of an Experience in Hydrology at an Argentine University

Authors: Hugo Roger Paz

Abstract: The emergence of Generative Artificial Intelligence (GenAI) has reshaped higher education, presenting both opportunities and ethical-pedagogical challenges. This article presents an empirical case study on the complete cycle (design, initial failure, redesign, and re-evaluation) of an intervention using an AI Tutor (ChatGPT) in the “Hydrology and Hydraulic Works” course (Civil Engineering, UTN-FRT, Argentina). The study documents two interventions in the same cohort (n=23). The first resulted in widespread failure (0% pass rate) due to superficial use and serious academic integrity issues (65% similarity, copies > 80%). This failure forced a comprehensive methodological redesign. The second intervention, based on a redesigned prompt (Prompt V2) with strict evidence controls (mandatory Appendix A with exported chat, minimum time $\geq$ 120 minutes, verifiable numerical exercise) and a refined rubric (Rubric V2), showed significantly better results: a median score of 88/100 and verifiable compliance with genuine interaction processes. Using a mixed-methods approach (reproducible document analysis and rubric analysis), the impact of the redesign on integrity and technical performance is evaluated. The results demonstrate that, without explicit process controls, students prioritize efficiency over deep learning, submitting documents without real traceability. A transferable assessment protocol for STEM courses is proposed, centered on “auditable personal zones,” to foster higher-order thinking. The study provides key empirical evidence from the context of a public Latin American university.

Link: https://arxiv.org/abs/2510.22279

css.php