Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays

Authors: Haowei Hua(1), Hong Jiao(2), Xinyi Wang(3) ((1) Princeton University, (2) University of Maryland, College Park, (3) University of Maryland, College Park & Beijing Normal University)

Abstract: BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.

Link: https://arxiv.org/abs/2510.22830

CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning

Authors: Hamed Mahdavi(Pennsylvania State University), Pouria Mahdavinia(Pennsylvania State University), Alireza Farhadi(Amirkabir University of Technology), Pegah Mohammadipour(Pennsylvania State University), Samira Malek(Pennsylvania State University), Majid Daliri(New York University), Pedram Mohammadipour(Amirkabir University of Technology), Alireza Hashemi(City University of New York), Amir Khasahmadi(Autodesk), Vasant Honavar(Pennsylvania State University)

Abstract: State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.

Link: https://arxiv.org/abs/2510.27094

Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring

Authors: Hong Jiao, Hanna Choi, Haowei Hua

Abstract: This study explored the utilities of rationales generated by GPT-4.1 and GPT-5 in automated scoring using Prompt 6 essays from the 2012 Kaggle ASAP data. Essay-based scoring was compared with rationale-based scoring. The study found in general essay-based scoring performed better than rationale-based scoring with higher Quadratic Weighted Kappa (QWK). However, rationale-based scoring led to higher scoring accuracy in terms of F1 scores for score 0 which had less representation due to class imbalance issues. The ensemble modeling of essay-based scoring models increased the scoring accuracy at both specific score levels and across all score levels. The ensemble modeling of essay-based scoring and each of the rationale-based scoring performed about the same. Further ensemble of essay-based scoring and both rationale-based scoring yielded the best scoring accuracy with QWK of 0.870 compared with 0.848 reported in literature.

Link: https://arxiv.org/abs/2510.27131

Exploring the Applications of Generative AI in High School STEM Education

Authors: Ishaan Masilamony

Abstract: In recent years, ChatGPT \cite{openai_2023_gpt4} along with Microsoft Copilot have become subjects of great discourse, particularly in the field of education. Prior research has hypothesized on potential impacts these tools could have on student learning and performance. These have primarily relied on trends from prior applications of technology in education and an understanding of the limitations and strengths of Generative AI in other applications. This study utilizes an experimental approach to analyze the impacts of Generative AI on high school STEM education (physics in particular). In accordance with most findings, generative AI does have some positive impact on student performance. However, our findings have shown that the most significant impact is an increase in student engagement with the subject.

Link: https://arxiv.org/abs/2510.21718

Hybrid Instructor Ai Assessment In Academic Projects: Efficiency, Equity, And Methodological Lessons

Authors: Hugo Roger Paz

Abstract: In technical subjects characterized by high enrollment, such as Basic Hydraulics, the assessment of reports necessitates superior levels of objectivity, consistency, and formative feedback; goals often compromised by faculty workload. This study presents the implementation of a generative artificial intelligence (AI) assisted assessment system, supervised by instructors, to grade 33 hydraulics reports. The central objective was to quantify its impact on the efficiency, quality, and fairness of the process. The employed methodology included the calibration of the Large Language Model (LLM) with a detailed rubric, the batch processing of assignments, and a human-in-the-loop validation phase. The quantitative results revealed a noteworthy 88% reduction in grading time (from 50 to 6 minutes per report, including verification) and a 733% increase in productivity. The quality of feedback was substantially improved, evidenced by 100% rubric coverage and a 150% increase in the anchoring of comments to textual evidence. The system proved to be equitable, exhibiting no bias related to report length, and highly reliable post-calibration (r = 0.96 between scores). It is concluded that the hybrid AI-instructor model optimizes the assessment process, thereby liberating time for high-value pedagogical tasks and enhancing the fairness and quality of feedback, in alignment with UNESCO’s principles on the ethical use of AI in education.

Link: https://arxiv.org/abs/2510.22286

css.php