Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR Medical Informatics

Date Submitted: Feb 26, 2025
Open Peer Review Period: Mar 26, 2025 - May 21, 2025
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Portraits of Large Language Models: Deciphering the Taxonomy of Medical LLMs

  • Radha Nagarajan; 
  • Vanessa Klotzman; 
  • Midori Kondo; 
  • Sandip Godambe; 
  • Adam Gold; 
  • John Henderson; 
  • Steven Martel

ABSTRACT

Background:

Large Language Models (LLMs) continue to enjoy enterprise-wide adoption in healthcare while evolving in number, size, complexity, cost, and more importantly performance. Performance benchmarks play a critical role in their ranking across community leaderboards and subsequent adoption.

Objective:

Given the small operating margins of healthcare organizations and growing interest in LLMs and conversational AI, there is an urgent need for objective approaches that can assist in identifying viable LLMs without compromising their performance. The objective of the present study is to generate a taxonomy portrait of medical LLMs (N = 33) whose domain-specific and domain non-specific multivariate performance benchmarks were available from Open-Medical LLM and Open LLM leaderboards on Hugging Face.

Methods:

Hierarchical clustering of multivariate performance benchmarks is used to generate taxonomy portraits revealing inherent partitioning of the medical LLMs across diverse tasks. While domain-specific taxonomy is generated using nine performance benchmarks related to medicine from the Hugging Face Open-Medical LLM initiative, domain non-specific taxonomy is presented in tandem to assess their performance on a set of six benchmarks on generic tasks from the Hugging Face Open LLM initiative. Subsequently, non-parametric Wilcoxon Ranksum test and linear correlation is used to assess differential changes in the performance benchmarks between two broad groups of LLMs and potential redundancies between the benchmarks.

Results:

Two broad families of LLMs with statistically significant differences (\alpha = 0.05) in performance benchmarks are identified for each of the taxonomies. Consensus in their performance on the domain-specific, and domain non-specific tasks revealed inherent robustness of these LLMs across diverse tasks. Subsequently, statistically significant correlations between performance benchmarks revealed inherent redundancies, indicating a subset of these benchmarks may be sufficient in assessing the domain-specific performance of medical LLMs.

Conclusions:

Understanding the medical LLM taxonomies is an important step in identifying LLMs with similar performance while aligning with the needs, economics, and other demands of healthcare organizations. While the focus of the present study is on a subset of medical LLMs from the Hugging Face, enhanced transparency of performance benchmarks and economics across a larger family of medical LLMs is needed to generate more comprehensive taxonomy portraits accelerating their strategic and equitable adoption in healthcare. Clinical Trial: Not applicable


 Citation

Please cite as:

Nagarajan R, Klotzman V, Kondo M, Godambe S, Gold A, Henderson J, Martel S

Portraits of Large Language Models: Deciphering the Taxonomy of Medical LLMs

JMIR Preprints. 26/02/2025:72918

DOI: 10.2196/preprints.72918

URL: https://preprints.jmir.org/preprint/72918

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.