home - login - register

PERC 2025 Abstract Detail Page

Previous Page  |  New Search  |  Browse All

Abstract Title: Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis
Abstract Type: Contributed Poster Presentation
Abstract: Qualitative analysis is typically limited to small datasets because it is time-intensive. Moreover, a second human rater is required to ensure reliable findings. Artificial intelligence tools may replace human raters if we demonstrate high reliability compared to human ratings. We investigated the inter-rater reliability of state-of-the-art Large Language Models (LLMs), ChatGPT-4o and ChatGPT-4.5-preview, in rating audio transcripts coded manually. We explored prompts and hyperparameters to optimize model performance. The participants were 14 undergraduate student groups from a university in the midwestern U.S. who discussed problem-solving strategies for a project. We prompted an LLM to replicate manual coding, and calculated Cohen's Kappa for inter-rater reliability. After optimizing model hyperparameters and prompts, the results showed substantial agreement (k>0.6) for three themes and moderate agreement on one. Our findings demonstrate the potential of GPT-4o and GPT-4.5 for efficient, scalable qualitative analysis in physics education and identify their limitations in rating subjective constructs.
Footnote: This work is supported in part by U.S. National Foundation Grant 23000645. Opinions expressed are of the authors and not the Foundation.
Session Time: Poster Session A
Poster Number: A-4
Contributed Paper Record: Contributed Paper Information
Contributed Paper Download: Download Contributed Paper

Author/Organizer Information

Primary Contact: Nikhil Sanjay Borse
Purdue University
Co-Author(s)
and Co-Presenter(s)
Ravishankar Chatta Subramaniam, Purdue University
N. Sanjay Rebello, Purdue University

Contributed Poster

Contributed Poster: Download the Contributed Poster