TY - JOUR AU - Mansoor, Masab AU - Ibrahim, Andrew F AU - Grindem, David AU - Baig, Asad PY - 2025 DA - 2025/3/19 TI - Large Language Models for Pediatric Differential Diagnoses in Rural Health Care: Multicenter Retrospective Cohort Study Comparing GPT-3 With Pediatrician Performance JO - JMIRx Med SP - e65263 VL - 6 KW - natural language processing KW - NLP KW - machine learning KW - ML KW - artificial intelligence KW - language model KW - large language model KW - LLM KW - generative pretrained transformer KW - GPT KW - pediatrics AB - Background: Rural health care providers face unique challenges such as limited specialist access and high patient volumes, making accurate diagnostic support tools essential. Large language models like GPT-3 have demonstrated potential in clinical decision support but remain understudied in pediatric differential diagnosis. Objective: This study aims to evaluate the diagnostic accuracy and reliability of a fine-tuned GPT-3 model compared to board-certified pediatricians in rural health care settings. Methods: This multicenter retrospective cohort study analyzed 500 pediatric encounters (ages 0‐18 years; n=261, 52.2% female) from rural health care organizations in Central Louisiana between January 2020 and December 2021. The GPT-3 model (DaVinci version) was fine-tuned using the OpenAI application programming interface and trained on 350 encounters, with 150 reserved for testing. Five board-certified pediatricians (mean experience: 12, SD 5.8 years) provided reference standard diagnoses. Model performance was assessed using accuracy, sensitivity, specificity, and subgroup analyses. Results: The GPT-3 model achieved an accuracy of 87.3% (131/150 cases), sensitivity of 85% (95% CI 82%‐88%), and specificity of 90% (95% CI 87%‐93%), comparable to pediatricians’ accuracy of 91.3% (137/150 cases; P=.47). Performance was consistent across age groups (0‐5 years: 54/62, 87%; 6‐12 years: 47/53, 89%; 13‐18 years: 30/35, 86%) and common complaints (fever: 36/39, 92%; abdominal pain: 20/23, 87%). For rare diagnoses (n=20), accuracy was slightly lower (16/20, 80%) but comparable to pediatricians (17/20, 85%; P=.62). Conclusions: This study demonstrates that a fine-tuned GPT-3 model can provide diagnostic support comparable to pediatricians, particularly for common presentations, in rural health care. Further validation in diverse populations is necessary before clinical implementation. SN - 2563-6316 UR - https://xmed.jmir.org/2025/1/e65263 UR - https://doi.org/10.2196/65263 DO - 10.2196/65263 ID - info:doi/10.2196/65263 ER -