Challenges of Llms in Medical Applications

Explore top LinkedIn content from expert professionals.

  • View profile for Yauheni "Owen" Solad MD MBA

    Corporate VP of Clinical AI at HCA Healthcare

    6,734 followers

    Is AI Easing Clinician Workloads—or Adding More? Healthcare is rapidly embracing AI and Large Language Models (LLMs), hoping to reduce clinician workload. But early adoption reveals a more complicated reality: verifying AI outputs, dealing with errors, and struggling with workflow integration can actually increase clinicians’ cognitive load. Here are four key considerations: 1. Verification Overload - LLMs might produce coherent summaries, but “coherent” doesn’t always mean correct. Manually double-checking AI-generated notes or recommendations becomes an extra task on an already packed schedule. 2. Trust Erosion - Even a single AI-driven mistake—like the wrong dosage—can compromise patient safety. Errors that go unnoticed fracture clinicians’ trust and force them to re-verify every recommendation, negating AI’s efficiency. 3. Burnout Concerns - AI is often touted as a remedy for burnout. Yet if it’s poorly integrated or frequently incorrect, clinicians end up verifying and correcting even more, adding mental strain instead of relieving it. 4. Workflow Hurdles LLMs excel in flexible, open-ended tasks, but healthcare requires precision, consistency, and structured data. This mismatch can lead to patchwork solutions and unpredictable performance. Moving Forward - Tailored AI: Healthcare-specific designs that reduce “prompt engineering” and improve accuracy. - Transparent Validation: Clinicians need to understand how AI arrives at its conclusions. - Human-AI Collaboration: AI should empower, not replace, clinicians by streamlining verification. - Continuous Oversight: Monitoring, updates, and ongoing training are crucial for safe, effective adoption. If implemented thoughtfully, LLMs can move from novelty to genuine clinical asset. But we have to address these limitations head-on to ensure AI truly lightens the load. Want a deeper dive? Check out the full article where we explore each of these points in more detail—and share how we can build AI solutions that earn clinicians’ trust instead of eroding it.

  • Our paper got published in JAMA! 🎉 Earlier this year, Suhana Bedi Yutong Liu and I led a paper at Stanford University School of Medicine that highlights critical gaps in evaluating Large Language Models (LLMs) in healthcare. We categorized all 519 relevant studies from 1 Jan 2022 to 19 Feb 2024 into (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty. In doing so, we revealed: - Only 5% used real patient care data in their testing and evaluation. - Key tasks like prescription writing and clinical summarization are underexplored. - The focus on accuracy dominates, while vital aspects like fairness, bias, and toxicity remain largely neglected. - Only 1 study assessed the financial impact of LLMs in healthcare. Why does this matter? - Real patient care data encompasses the complexities of clinical practice, and so a thorough evaluation of LLM performance should mirror clinical performance as closely as possible to truly determine its effectiveness. - There are many high-value administrative tasks in health care that are often labor intensive, requiring manual input and contributing to physician burnout, that are currently chronically understudied. - Only 15.8% of studies conducted any evaluation that delves into how factors such as race and ethnicity, gender, or age affect bias in the model’s output. Future research should place greater emphasis on fairness, bias or toxicity evaluations if we want to stop LLMs from perpetuating bias. - Future evaluations must estimate total implementation costs, including model operation, monitoring, maintenance, and infrastructure adjustments, before reallocating resources from other health care initiatives. The paper calls for standardized evaluation metrics, broader coverage of healthcare applications, and real patient care data to ensure safe and equitable AI integration. This is essential for the responsible adoption of LLMs in healthcare to truly improve patient care. And I am delighted that I get to work on implementing the findings of this research at Coalition for Health AI (CHAI). This paper could not have happened without Nigam Shah's constant support, leadership and guidance, and that of our co-authors Dev Dash Sanmi Koyejo Alison Callahan Jason Fries Michael Wornow Akshay Swaminathan Lisa Lehmann H. Christy Hong, MD MBA Mehr Kashyap Akash Chaurasia Nirav R. Shah Karandeep Singh Troy Tazbaz Arnold Milstein Michael Pfeffer. Thank you also to Nicholas Chedid, MD, MBA Brian Anderson, MD and Justin Norden, MD, MBA, MPhil for your guidance and mentorship. And of course, huge shout out to my co-conspirators Yutong Liu Suhana Bedi - you are the best team. This is the first paper I've ever written, and I'm eternally grateful to you all for showing me how it's done. Full article here: https://lnkd.in/eimh9BNV

  • View profile for James Barry, MD, MBA

    AI Critical Optimist | Experienced Physician Leader | Key Note Speaker | Co-Founder NeoMIND-AI and Clinical Leaders Group | Pediatric Advocate| Quality Improvement | Patient Safety

    4,466 followers

    I am sure you have heard by now... Microsoft’s MAI-DxO, a "medical super-intelligence agentic model with an orchestrator," achieved 80% diagnostic accuracy—four times higher than practicing physicians on 304 NEJM Group Clinicopathological Conference cases (https://lnkd.in/giCG-zZd). What the study (https://lnkd.in/gKph2SiT) shows: - MAI-DxO, an “orchestrator” thinks like a multidisciplinary team. - Similar diagnostic gains appear across diverse model families (OpenAI, Gemini, Claude, Grok, DeepSeek AI, Llama), suggesting the orchestrator’s strategy..not any single frontier model is the reason. Very Impressive...but to me it seems many creating these models believe that being a clinician is mainly about being a diagnostician.  That is quite far from reality. Other recent noteworthy studies: 🟢 Stanford University’s new MedHELM benchmark (https://lnkd.in/gjE5BH_c) shows frontier LLMs shine in note-writing and patient communications, yet stumble on billing codes and prior auths. 🟢 Hippocratic AI’s Real World Evaluation-LLM study (https://lnkd.in/gmHXJZem) needed 6,234 U.S. clinicians and >300 k conversations to push a patient-facing “care agent” past 99% correct-advice rates—that's a lot of resources. 🟢 Most studies of LLMs on healthcare tasks do not use real data. See study from Bedi et al (https://lnkd.in/gfuv8-vp) that showed across 519 papers published between 2022 and early 2024, only 5% drew on data generated during routine patient care.  🟢Epistemic uncertainty is a big issue for LLM adoption in healthcare. Ethan Goh, MD (https://lnkd.in/gvEQS4YX ) found that giving PCPs direct access to GPT-4 did not improve their diagnostic reasoning. Instead of LLM vs Physician, the real comparison should be Physician vs Physician ➕ LLM. Clinical work is a team sport that occurs over time, not a snapshot in which diagnosis is the primary focus: data gathering ➜ hypothesis generation ➜ negotiation of uncertainty ➜ patient-centered decisions➜ nuances to support the patient’s treatment or care plan---- performed iteratively, often over years. Can LLMs or agentic models help across a care continuum?.. as orchestrator of diagnostic tests, care plans, prevention specialist, or as better, more patient, less time constrained educator... “Did the team (human + model) make safer, more effective, ethical, and equitable choices?” Thoughts on: ▪️ How do we study and then teach clinicians to interrogate model epistemics—knowing when to trust, verify, or override? ▪️ Where will augmented workflows (ambient charting, preventative health) deliver the earliest ROI that benefits the patient and clinician? Let’s move the debate from “humans or algorithms” to “humans with algorithms, responsibly deployed.” Scott J. Campbell MD, MPH #UsingWhatWeHaveBetter

  • View profile for Yubin Park, PhD
    Yubin Park, PhD Yubin Park, PhD is an Influencer

    CEO at mimilabs | CTO at falcon | LinkedIn Top Voice | Ph.D., Machine Learning and Health Data

    18,101 followers

    The Context Problem: Why LLMs Aren't Ready for Claims Processing (Yet) Soon, every claim will be fed to some form of LLM. The challenge is whether the LLM has the right "context" to review medical claims. Many claims are already reviewed by LLMs, but soon every single claim may need to go through AI to determine coverage, amount, and compliance. The problem? If you've tried asking medical questions to LLMs, you know it takes a few iterations to get the right answer. You need to steer AI to the right context so it's grounded with the right knowledge and scope. If we need to do this with every claim that comes to payers... oh man. It doesn't matter if LLMs are cheaper or not. With all the back and forth, it may not be that cheap, and think of the time. The practical challenge is whether we can provide the right tools to ground LLMs in the right context quickly. That's what falcon health has been working on. So far, I've been investigating various research angles with LLMs, and I think I've discovered somewhat abstract repetitive tasks—looking up NCD/LCD and interpreting what the documents say, and how that translates to claims decisions. Here's how I packaged this repetitive process using function calling for our FWA investigation agent. You provide the procedure and diagnosis codes from a claim, and ask if this combination makes sense. The system can now reason using the National and Local Coverage Determination databases, as well as other literature, to provide you with an answer. Instead of hoping LLMs will magically understand complex coverage rules, we're building tools that rapidly surface the right regulatory context for each claim. This "context injection" eliminates the need for expensive back-and-forth, resulting in single, authoritative decisions. Questions for you all: How are you interpreting claims? How do you find claims are compliant? Does it take a long time to review and use NCDs/LCDs? And also translate to the language of medical claims? Would love to hear your thoughts! #HealthcareAI #MedicalClaims #LLM #ClaimsProcessing https://lnkd.in/dnZ5Pv4W

    sentinel investigator in action

    sentinel investigator in action

    https://www.loom.com

  • View profile for Douglas Flora, MD, LSSBB

    Oncologist | Author, Rebooting Cancer Care | Executive Medical Director | Editor-in-Chief, AI in Precision Oncology | ACCC President-Elect | Founder, CEO, TensorBlack | Cancer Survivor

    14,723 followers

    🚨 AI in Healthcare: A Regulatory Wake-Up Call? 🚨 Large language models (LLMs) like GPT-4 and Llama-3 are showing incredible promise in clinical decision support. But here’s the catch: they’re not regulated as medical devices, and yet, they’re already generating recommendations that look a lot like regulated medical guidance. A recent study found that even when prompted to avoid device-like recommendations, these AI models often provided clinical decision support in ways that could meet FDA criteria for a medical device. In some cases, their responses aligned with established medical standards—but in others, they ventured into high-risk territory, making treatment recommendations that should only come from trained professionals. This raises a big question: Should AI-driven clinical decision support be regulated? And if so, how do we balance innovation with patient safety? Right now, there’s no clear framework for LLMs used by non-clinicians in critical situations. 🔹 What does this mean for healthcare professionals? AI is advancing fast, and while it can be a powerful tool, it’s crucial to recognize its limitations. 🔹 For regulators? There’s an urgent need to define new oversight models that account for generative AI’s unique capabilities. 🔹 For AI developers? Transparency, accuracy, and adherence to safety standards will be key to building trust in medical AI applications. As AI continues to evolve, we’re entering uncharted territory. The conversation about regulation isn’t just theoretical—it’s becoming a necessity. What do you think? Should AI in clinical decision support be regulated like a medical device? Let’s discuss. 👇

  • View profile for Benjamin Schwartz, MD, MBA
    Benjamin Schwartz, MD, MBA Benjamin Schwartz, MD, MBA is an Influencer

    Chief Medical Officer @ Commons Clinic | MSK & Specialty Care Strategy | ASC & Value-Based Care Leadership | Healthcare Innovation

    36,405 followers

    Turns out patients prefer ChatGPT's medical advice over human nurses', but only when they don't know it's coming from AI. This fascinating study of 253 TKA patients investigated the use of LLMs to answer patients' questions following knee replacement surgery. Orthopedic nurses and ChatGPT were asked the same questions, and their answers were graded by surgeons. Grades were almost identical between the two. When the researchers surveyed patients about their preferences, an interesting paradox emerged: 54% of patients were more comfortable with ChatGPT's answers compared to 34% for nurses' responses. Yet when asked directly, 93% said they'd be uncertain about trusting AI for medical questions. Almost 66% responded that their comfort level in trusting the answer would change if they knew it was provided by ChatGPT. In conclusion, ChatGPT gave good answers that patients preferred, but patients still demonstrated a level of discomfort and distrust in AI. Is ignorance bliss? As AI becomes more ubiquitous in healthcare, will skepticism hurt adoption? How do we bridge that gap? The study highlights the potential of Generative AI and LLMs but also reveals the barriers that remain. As we move full steam ahead to introduce artificial intelligence tools in medicine, we might want to take a moment to consider the patient perspective.

  • View profile for Dipu Patel, DMSc, MPAS, ABAIM, PA-C

    📚🤖🌐 Educating the next generation of digital health clinicians and consumers Digital Health + AI Thought Leader| Speaker| Strategist |Author| Innovator| Board Executive Leader| Mentor| Consultant | Advisor| TheAIPA

    5,247 followers

    This JMIR study introduces the first large-language-model (LLM)–assisted surgical consent forms used in Korean liver resection procedures, offering a unique blend of clarity and innovation for clinicians and digital health educators. Key Takeaways - LLM edits significantly simplified sentence structures and vocabulary by reducing text complexity and enhancing accessibility. - Expert ratings showed a meaningful drop in risk descriptions (from 2.29 to 1.92, β₁=–0.371; P=.01) and overall impression (2.21 to 1.71, β₁=–0.500; P=.03). - Qualitative feedback felt the text was “overly simplified” and “less professional,” suggesting nuance was lost amid gains in clarity. - This is one of the first non-English studies, highlighting the challenge of applying LLMs across linguistic and cultural contexts. My thoughts... As a clinical educator deeply invested in AI, healthcare quality, and patient-centered communication, this study underscores some vital lessons. - Enhanced readability ≠ sufficient informed consent. - Balance is key. We must safeguard medical and legal integrity while making content accessible, especially for multilingual, multicultural patient populations. - Clinicians and digital-health leaders must be trained to use LLM-generated content critically. https://lnkd.in/ej8UnXaJ

  • View profile for Eric Henry

    Advising Boards and Management in Medical Device & Digital Health Companies | Crisis Leadership & Regulatory Strategy | 35+ Years Guiding Companies Through FDA Compliance & High-Stakes Situations

    7,573 followers

    Interesting article regarding the risks of generative AI used for clinical decision support. The authors point to a hole in FDA's clinical decision support guidance, while also noting those that feel FDA's regulatory framework for AI/ML will stifle innovation. One huge regulatory hole I didn't see mentioned but that definitely keeps me up at night is the lack of regulatory oversight for these systems, or indeed any medical device or health IT system, that is developed purely within the walls of a hospital system. Any hospital system with internal R&D can design, develop, manufacture, and/or release devices or systems that would normally be regulated by FDA or ONC without any regulatory oversight so long as they do not market or distribute it outside the hospital system. Hospital systems are already developing and using Large Language Models (LLMs), for example, to support clinical decision-making and even drive certain doctor-patient interactions, with huge impact to patient safety. Hospital R&D departments also routinely develop their own medical devices using a variety of technologies. None of these devices / systems are regulated by FDA, ONC, or any other government agency the way a commercial product would be. In other words, there are no design controls, no defined change management criteria, no certification schemes, no submissions for clearance / approval based on clinical and regulatory review, no production and process controls, no post-market surveillance, no risk management, etc. except as imposed internally by the hospital system itself. Especially with the implementation of advanced AI models, which make deployment of safety-critical systems within the walls of hospitals even easier than with electro-mechanical devices, industry, government, and the public should be looking for a way to close this gap and provide greater confidence in the safety and effectiveness of hospital-developed devices and systems. Joint Commission to the rescue? Not so far. Just some food for thought today. https://lnkd.in/gfzZTS3N

  • View profile for Maxim (Max) Topaz PhD, RN, MA, FAAN, FIAHSI, FACMI

    Health AI & Nursing Informatics Leader | 200+ Pubs (JAMA, Nature) | $25M+ NIH Funded | Global Keynote Speaker on AI | Columbia

    5,644 followers

    Critical new evidence on LLMs in healthcare: Shool et al.'s (2025) systematic review of 761 studies reveals a remarkable acceleration. Healthcare AI research grew from 1 study in 2019 to 557 in 2024. The research reveals clear patterns: ChatGPT and GPT-4 dominate evaluations (93.5%), with accuracy as the primary focus across studies. Effect measurement tells the story - while accuracy gets extensive attention, broader performance evaluation appears in only 4.5% of studies. Research priorities don't align perfectly with clinical needs. Surgery leads applications (28.2%), yet critical areas remain underrepresented: cardiology (1.9%), emergency medicine (2.7%), and notably, nursing applications (0.7%) despite nurses being healthcare's largest workforce. Particularly interesting: medical-domain LLMs account for only 6.45% of evaluations, suggesting general models may be meeting clinical demands effectively. However, most studies evaluated already-retired model versions, creating a natural lag in this rapidly evolving field. The evidence shows transformative potential emerging, but implementation gaps remain. Safety and bias evaluation still need focused attention alongside accuracy metrics. As healthcare leaders, we have emerging empirical guidance for AI adoption. Not whether to integrate these tools, but how to deploy them strategically - ensuring comprehensive evaluation while addressing real clinical workflows. The future of healthcare isn't human or AI. It's human with AI, systematically validated. #HealthcareAI #LLM #DigitalHealth #MedicalAI #ClinicalDecisionSupport #HealthTech

Explore categories