Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR Medical Education

Date Submitted: Mar 27, 2025
Open Peer Review Period: Mar 28, 2025 - May 23, 2025
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Comparative Evaluation of Advanced AI Reasoning Models in Korean National Licensing Examination OpenAI vs DeepSeek

  • Jin-Gyu Lee; 
  • Gyeong Hoon Kim; 
  • Jiahn Chae; 
  • Jun Suh Lee; 
  • Hyun-Young Shin

ABSTRACT

Artificial intelligence (AI) has advanced in natural language processing and reasoning, with large language models (LLMs) increasingly assessed for medical education and licensing exams. Given the growing use of AI in medical licensing examinations, evaluating their performance on non-Western, region-specific tests like the Korean Medical Licensing Examination (KMLE) is crucial for assessing their real-world applicability. This study compared five LLMs—GPT-4o, o1, o3-mini (OpenAI), DeepSeek-V3, and DeepSeek-R1 (DeepSeek)—on the KMLE. A total of 150 multiple-choice questions from the 2024 KMLE were extracted and categorized into three domains: Local Health & Medical Laws, Preventive Medicine, and Clinical Medicine. Graph-based questions were excluded. Each model completed five independent runs via API, with accuracy assessed against official answers. Statistical differences were analyzed using ANOVA, and consistency was measured using Fleiss' kappa coefficient. o1 achieved the highest overall accuracy (94.3%), excelling in Clinical Medicine (97.5%) and Local Health & Medical Law (81.0%), while DeepSeek-R1 led in Preventive Medicine (92.6%). Despite domain-specific variations, all models surpassed passing criteria. For consistency, o1 ranked highest (97.1%), with DeepSeek-V3 excelling in Local Health & Medical Law (97.5%). Performance declined in Local Health & Medical Law, likely due to legal complexities and limited Korean-language training data. This is the first study to compare OpenAI and DeepSeek models on medical licensing exam, demonstrating their strong performance, with o1 and DeepSeek-R1 ranking within the top 10% of human candidates. While o1 was the most accurate, DeepSeek-R1 provided a cost-effective alternative. Future research should optimize LLMs for non-English exams and develop Korea-specific AI models to improve accuracy in legal domains.


 Citation

Please cite as:

Lee JG, Kim GH, Chae J, Lee JS, Shin HY

Comparative Evaluation of Advanced AI Reasoning Models in Korean National Licensing Examination OpenAI vs DeepSeek

JMIR Preprints. 27/03/2025:75032

DOI: 10.2196/preprints.75032

URL: https://preprints.jmir.org/preprint/75032

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.