Conference Proceedings Detail Page
|
Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis
Qualitative analysis is typically limited to small datasets because it is time-intensive. Moreover, a second human rater is required to ensure reliable findings. Artificial intelligence tools may replace human raters if we demonstrate high reliability compared to human ratings. We investigated the inter-rater reliability of state-of-the-art Large Language Models (LLMs), ChatGPT-4o and ChatGPT-4.5-preview, in rating audio transcripts coded manually. We explored prompts and hyperparameters to optimize model performance. The participants were 14 undergraduate student groups from a university in the midwestern U.S. who discussed problem-solving strategies for a project. We prompted an LLM to replicate manual coding, and calculated Cohen's Kappa for inter-rater reliability. After optimizing model hyperparameters and prompts, the results showed substantial agreement (k>0.6) for three themes and moderate agreement on one. Our findings demonstrate the potential of GPT-4o and GPT-4.5 for efficient, scalable qualitative analysis in physics education and identify their limitations in rating domain-general constructs.
Physics Education Research Conference 2025
Part of the PER Conference series Washington, DC: August 6-7, 2025 Pages 92-98
ComPADRE is beta testing Citation Styles!
<a href="https://www.per-central.org/items/detail.cfm?ID=17123">Borse, N, R. Chatta Subramaniam, and N. Rebello. "Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis." Paper presented at the Physics Education Research Conference 2025, Washington, DC, August 6-7, 2025.</a>
N. Borse, R. Chatta Subramaniam, and N. Rebello, , presented at the Physics Education Research Conference 2025, Washington, DC, 2025, WWW Document, (https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051).
N. Borse, R. Chatta Subramaniam, and N. Rebello, Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis, presented at the Physics Education Research Conference 2025, Washington, DC, 2025, <https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051>.
Borse, N., Chatta Subramaniam, R., & Rebello, N. (2025, August 6-7). Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis. Paper presented at Physics Education Research Conference 2025, Washington, DC. Retrieved December 12, 2025, from https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051
Borse, N, R. Chatta Subramaniam, and N. Rebello. "Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis." Paper presented at the Physics Education Research Conference 2025, Washington, DC, August 6-7, 2025. https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051 (accessed 12 December 2025).
Borse, Nikhil Sanjay, Ravishankar Chatta Subramaniam, and N. Sanjay Rebello. "Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis." Physics Education Research Conference 2025. Washington, DC: 2025. 92-98 of PER Conference. 12 Dec. 2025 <https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051>.
@inproceedings{
Author = "Nikhil Sanjay Borse and Ravishankar Chatta Subramaniam and N. Sanjay Rebello",
Title = {Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis},
BookTitle = {Physics Education Research Conference 2025},
Pages = {92-98},
Address = {Washington, DC},
Series = {PER Conference},
Month = {August 6-7},
Year = {2025}
}
%A Nikhil Sanjay Borse %A Ravishankar Chatta Subramaniam %A N. Sanjay Rebello %T Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis %S PER Conference %D August 6-7 2025 %P 92-98 %C Washington, DC %U https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051 %O Physics Education Research Conference 2025 %O August 6-7 %O application/pdf %0 Conference Proceedings %A Borse, Nikhil Sanjay %A Chatta Subramaniam, Ravishankar %A Rebello, N. Sanjay %D August 6-7 2025 %T Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis %B Physics Education Research Conference 2025 %C Washington, DC %P 92-98 %S PER Conference %8 August 6-7 %U https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051 Disclaimer: ComPADRE offers citation styles as a guide only. We cannot offer interpretations about citations as this is an automated procedure. Please refer to the style manuals in the Citation Source Information area for clarifications.
Citation Source Information
The AIP Style presented is based on information from the AIP Style Manual. The AJP/PRST-PER presented is based on the AIP Style with the addition of journal article titles and conference proceeding article titles. The APA Style presented is based on information from APA Style.org: Electronic References. The Chicago Style presented is based on information from Examples of Chicago-Style Documentation. The MLA Style presented is based on information from the MLA FAQ. Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis:Know of another related resource? Login to relate this resource to it. |
ContributeRelated MaterialsSimilar Materials |




