Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness

Authors: Haotian Deng, Chris Farber, Jiyoon Lee, David Tang

Abstract: Automated short-answer grading (ASAG) remains a challenging task due to the linguistic variability of student responses and the need for nuanced, rubric-aligned partial credit. While Large Language Models (LLMs) offer a promising solution, their reliability as automated judges in rubric-based settings requires rigorous assessment. In this paper, we systematically evaluate the performance of LLM-judges for rubric-based short-answer grading. We investigate three key aspects: the alignment of LLM grading with expert judgment across varying rubric complexities, the trade-off between uncertainty and accuracy facilitated by a consensus-based deferral mechanism, and the model’s robustness under random input perturbations and adversarial attacks. Using the SciEntsBank benchmark and Qwen 2.5-72B, we find that alignment is strong for binary tasks but degrades with increased rubric granularity. Our “Trust Curve” analysis demonstrates a clear trade-off where filtering low-confidence predictions improves accuracy on the remaining subset. Additionally, robustness experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions. Our work provides critical insights into the capabilities and limitations of rubric-conditioned LLM judges, highlighting the importance of uncertainty estimation and robustness testing for reliable deployment.

Link: https://arxiv.org/abs/2601.08843

Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment

Authors: Manas Khatore, Sumana Sridharan, Kevork Sulahian, Benjamin J. Smith, Shi Feng

Abstract: Automated answer matching, which leverages LLMs to evaluate free-text responses by comparing them to a reference answer, shows substantial promise as a scalable and aligned alternative to human evaluation. However, its reliability requires robustness against strategic attacks such as guesswork or verbosity that may artificially inflate scores without improving actual correctness. In this work, we systematically investigate whether such tactics deceive answer matching models by prompting examinee models to: (1) generate verbose responses, (2) provide multiple answers when unconfident, and (3) embed conflicting answers with the correct answer near the start of their response. Our results show that these manipulations do not increase scores and often reduce them. Additionally, binary scoring (which requires a matcher to answer with a definitive “correct” or “incorrect”) is more robust to attacks than continuous scoring (which requires a matcher to determine partial correctness). These findings show that answer matching is generally robust to inexpensive text manipulation and is a viable alternative to traditional LLM-as-a-judge or human evaluation when reference answers are available.

Link: https://arxiv.org/abs/2601.08849

Exploring the Effects of Generative AI Assistance on Writing Self-Efficacy

Authors: Yejoon Song, Bandi Kim, Yeju Kwon, Sung Park

Abstract: Generative AI (GenAI) is increasingly used in academic writing, yet its effects on students’ writing self-efficacy remain contingent on how assistance is configured. This pilot study investigates how ideation-level, sentence-level, full-process, and no AI support differentially shape undergraduate writers’ self-efficacy using a 2 by 2 experimental design with Korean undergraduates completing argumentative writing tasks. Results indicate that AI assistance does not uniformly enhance self-efficacy full AI support produced high but stable self-efficacy alongside signs of reduced ownership, sentence-level AI support led to consistent self-efficacy decline, and ideation-level AI support was associated with both high self-efficacy and positive longitudinal change. These findings suggest that the locus of AI intervention, rather than the amount of assistance, is critical in fostering writing self-efficacy while preserving learner agency.

Link: https://arxiv.org/abs/2601.09033

PATS: Personality-Aware Teaching Strategies with Large Language Model Tutors

Authors: Donya Rooein, Sankalan Pal Chowdhury, Mariia Eremeeva, Yuan Qin, Debora Nozza, Mrinmaya Sachan, Dirk Hovy

Abstract: Recent advances in large language models (LLMs) demonstrate their potential as educational tutors. However, different tutoring strategies benefit different student personalities, and mismatches can be counterproductive to student outcomes. Despite this, current LLM tutoring systems do not take into account student personality traits. To address this problem, we first construct a taxonomy that links pedagogical methods to personality profiles, based on pedagogical literature. We simulate student-teacher conversations and use our framework to let the LLM tutor adjust its strategy to the simulated student personality. We evaluate the scenario with human teachers and find that they consistently prefer our approach over two baselines. Our method also increases the use of less common, high-impact strategies such as role-playing, which human and LLM annotators prefer significantly. Our findings pave the way for developing more personalized and effective LLM use in educational applications.

Link: https://arxiv.org/abs/2601.08402

Can Large Language Models Differentiate Harmful from Argumentative Essays? Steps Toward Ethical Essay Scoring

Authors: Hongjin Kim, Jeonghyun Kang, Harksoo Kim

Abstract: This study addresses critical gaps in Automated Essay Scoring (AES) systems and Large Language Models (LLMs) with regard to their ability to effectively identify and score harmful essays. Despite advancements in AES technology, current models often overlook ethically and morally problematic elements within essays, erroneously assigning high scores to essays that may propagate harmful opinions. In this study, we introduce the Harmful Essay Detection (HED) benchmark, which includes essays integrating sensitive topics such as racism and gender bias, to test the efficacy of various LLMs in recognizing and scoring harmful content. Our findings reveal that: (1) LLMs require further enhancement to accurately distinguish between harmful and argumentative essays, and (2) both current AES models and LLMs fail to consider the ethical dimensions of content during scoring. The study underscores the need for developing more robust AES systems that are sensitive to the ethical implications of the content they are scoring.

Link: https://arxiv.org/abs/2601.05545

css.php