home - login - register

Conference Proceedings Detail Page

Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis
written by Nikhil Sanjay Borse, Ravishankar Chatta Subramaniam, and N. Sanjay Rebello
Qualitative analysis is typically limited to small datasets because it is time-intensive. Moreover, a second human rater is required to ensure reliable findings. Artificial intelligence tools may replace human raters if we demonstrate high reliability compared to human ratings. We investigated the inter-rater reliability of state-of-the-art Large Language Models (LLMs), ChatGPT-4o and ChatGPT-4.5-preview, in rating audio transcripts coded manually. We explored prompts and hyperparameters to optimize model performance. The participants were 14 undergraduate student groups from a university in the midwestern U.S. who discussed problem-solving strategies for a project. We prompted an LLM to replicate manual coding, and calculated Cohen's Kappa for inter-rater reliability. After optimizing model hyperparameters and prompts, the results showed substantial agreement (k>0.6) for three themes and moderate agreement on one. Our findings demonstrate the potential of GPT-4o and GPT-4.5 for efficient, scalable qualitative analysis in physics education and identify their limitations in rating domain-general constructs.
Physics Education Research Conference 2025
Part of the PER Conference series
Washington, DC: August 6-7, 2025
Pages 92-98
Subjects Levels Resource Types
Education - Applied Research
- Technology
= Computers
Education - Basic Research
- Research Design & Methodology
= Evaluation
= Statistics
General Physics
- Physics Education Research
- Lower Undergraduate
- Reference Material
= Research study
PER-Central Type Intended Users Ratings
- PER Literature
- Researchers
- Professional/Practitioners
  • Currently 0.0/5

Want to rate this material?
Login here!


Format:
application/pdf
Mirror:
https://doi.org/10.1119/perc.2025…
Access Rights:
Free access
License:
This material is released under a Creative Commons Attribution 4.0 license. Further distribution of this work must maintain attribution to the published article's author(s), title, proceedings citation, and DOI.
Rights Holder:
American Association of Physics Teachers
DOI:
10.1119/perc.2025.pr.Borse
NSF Number:
2300645
Keyword:
PERC 2025
Record Creator:
Metadata instance created October 20, 2025 by Lyle Barbato
Record Updated:
October 27, 2025 by Lyle Barbato
Last Update
when Cataloged:
October 28, 2025
ComPADRE is beta testing Citation Styles!

Record Link
AIP Format
N. Borse, R. Chatta Subramaniam, and N. Rebello, , presented at the Physics Education Research Conference 2025, Washington, DC, 2025, WWW Document, (https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051).
AJP/PRST-PER
N. Borse, R. Chatta Subramaniam, and N. Rebello, Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis, presented at the Physics Education Research Conference 2025, Washington, DC, 2025, <https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051>.
APA Format
Borse, N., Chatta Subramaniam, R., & Rebello, N. (2025, August 6-7). Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis. Paper presented at Physics Education Research Conference 2025, Washington, DC. Retrieved December 12, 2025, from https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051
Chicago Format
Borse, N, R. Chatta Subramaniam, and N. Rebello. "Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis." Paper presented at the Physics Education Research Conference 2025, Washington, DC, August 6-7, 2025. https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051 (accessed 12 December 2025).
MLA Format
Borse, Nikhil Sanjay, Ravishankar Chatta Subramaniam, and N. Sanjay Rebello. "Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis." Physics Education Research Conference 2025. Washington, DC: 2025. 92-98 of PER Conference. 12 Dec. 2025 <https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051>.
BibTeX Export Format
@inproceedings{ Author = "Nikhil Sanjay Borse and Ravishankar Chatta Subramaniam and N. Sanjay Rebello", Title = {Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis}, BookTitle = {Physics Education Research Conference 2025}, Pages = {92-98}, Address = {Washington, DC}, Series = {PER Conference}, Month = {August 6-7}, Year = {2025} }
Refer Export Format

%A Nikhil Sanjay Borse %A Ravishankar Chatta Subramaniam %A N. Sanjay Rebello %T Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis %S PER Conference %D August 6-7 2025 %P 92-98 %C Washington, DC %U https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051 %O Physics Education Research Conference 2025 %O August 6-7 %O application/pdf

EndNote Export Format

%0 Conference Proceedings %A Borse, Nikhil Sanjay %A Chatta Subramaniam, Ravishankar %A Rebello, N. Sanjay %D August 6-7 2025 %T Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis %B Physics Education Research Conference 2025 %C Washington, DC %P 92-98 %S PER Conference %8 August 6-7 %U https://www.compadre.org/Repository/document/ServeFile.cfm?ID=17123&DocID=6051


Disclaimer: ComPADRE offers citation styles as a guide only. We cannot offer interpretations about citations as this is an automated procedure. Please refer to the style manuals in the Citation Source Information area for clarifications.

Citation Source Information

The AIP Style presented is based on information from the AIP Style Manual.

The AJP/PRST-PER presented is based on the AIP Style with the addition of journal article titles and conference proceeding article titles.

The APA Style presented is based on information from APA Style.org: Electronic References.

The Chicago Style presented is based on information from Examples of Chicago-Style Documentation.

The MLA Style presented is based on information from the MLA FAQ.

Investigation of the Inter-Rater Reliability between Large Language Models and Human Raters in Qualitative Analysis:


Know of another related resource? Login to relate this resource to it.
Save to my folders

Contribute

Related Materials

Similar Materials