Challenges of Defining LLM Roles in Technical Analysis

Explore top LinkedIn content from expert professionals.

Summary

Defining roles for large language models (LLMs) in technical analysis means deciding how these AI systems should be used, who manages them, and what skills are needed—especially as tasks become more complex and collaborative. The challenge comes from the fast-changing technology, overlapping responsibilities, and the need to measure whether these systems are truly solving the problems they're meant to address.

  • Clarify job titles: Make a clear distinction between those who build new LLMs and those who build applications using existing LLMs to avoid skill mismatches.
  • Set role boundaries: Spell out each agent’s responsibilities, decision-making authority, and expected outcomes to support teamwork and prevent confusion.
  • Develop measurement standards: Create simple, reliable ways to track how well LLM agents are performing over time, including how they collaborate and adapt to changes.
Summarized by AI based on LinkedIn member posts
  • View profile for Philipp Schmid

    AI Developer Experience at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

    164,740 followers

    Why Do Multi-Agent LLM Systems “still” Fail? A new study explores why Multi Agent Systems are not significantly outperforming single-agent. The study identifies 14 failure modes multi-agent system. Multi-agent system (MAS) are agents that interact, communicate, and collaborate to achieve a shared goal, which would to be difficult or unreliable for a single agent to accomplish. Benchmark: - Selected five popular, open-source MAS (MetaGPT, ChatDev, HyperAgent, AppWorld, AG2) - Chose tasks representative of the MAS intended capabilities (Software D Development, SWE-Bench Lite, Utility Service Tasks, GSM-Plus) total of 150 tasks - Recorded the complete conversation logs, human annotators reviews, Cohen's Kappa score to ensure consistency and reliability, LLM-as-a-Judge Validation Multi Agent Failure modes: 1. Disobey Task Spec: Ignores task rules and requirements, leading to wrong output. 2. Disobey Role Spec: Agent acts outside its defined role and responsibilities. 3. Step Repetition: Unnecessarily repeats steps already completed, causing delays. 4. Loss of History: Forgets previous conversation context, causing incoherence. 5. Unaware Stop: Fails to recognize task completion, continues unnecessarily. 6. Conversation Reset: Dialogue unexpectedly restarts, losing context and progress. 7. Fail Clarify: Does not ask for needed information when unclear. 8. Task Derailment: Gradually drifts away from the intended task objective. 9. Withholding Info: Agent does not share important, relevant information. 10. Ignore Input: Disregards or insufficiently considers input from others. 11. Reasoning Mismatch: Actions do not logically follow from stated reasoning. 12. Premature Stop: Ends task too early before completion or information exchange. 13. No Verification: Lacks mechanisms to check or confirm task outcomes. 14. Incorrect Verification: Verification process is flawed, misses critical errors. How to improve Multi-Agent LLM System: 📝 Define tasks and agent roles clearly and explicitly in prompts. 🎯 Use examples in prompts to clarify expected task and role behavior. 🗣️ Design structured conversation flows to guide agent interactions. ✅ Implement self-verification steps in prompts for agents to check their reasoning. 🧩 Design modular agents with specific, well-defined roles for simpler debugging. 🔄 Redesign topology to incorporate verification roles and iterative refinement processes. 🤝 Implement cross-verification mechanisms for agents to validate each other. ❓ Design agents to proactively ask for clarification when needed. 📜 Define structured conversation patterns and termination conditions. Github: https://lnkd.in/ebmCg28d Paper: https://lnkd.in/etgsH6BH

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    22,974 followers

    Evaluating LLMs is hard. Evaluating agents is even harder. This is one of the most common challenges I see when teams move from using LLMs in isolation to deploying agents that act over time, use tools, interact with APIs, and coordinate across roles. These systems make a series of decisions, not just a single prediction. As a result, success or failure depends on more than whether the final answer is correct. Despite this, many teams still rely on basic task success metrics or manual reviews. Some build internal evaluation dashboards, but most of these efforts are narrowly scoped and miss the bigger picture. Observability tools exist, but they are not enough on their own. Google’s ADK telemetry provides traces of tool use and reasoning chains. LangSmith gives structured logging for LangChain-based workflows. Frameworks like CrewAI, AutoGen, and OpenAgents expose role-specific actions and memory updates. These are helpful for debugging, but they do not tell you how well the agent performed across dimensions like coordination, learning, or adaptability. Two recent research directions offer much-needed structure. One proposes breaking down agent evaluation into behavioral components like plan quality, adaptability, and inter-agent coordination. Another argues for longitudinal tracking, focusing on how agents evolve over time, whether they drift or stabilize, and whether they generalize or forget. If you are evaluating agents today, here are the most important criteria to measure: • 𝗧𝗮𝘀𝗸 𝘀𝘂𝗰𝗰𝗲𝘀𝘀: Did the agent complete the task, and was the outcome verifiable? • 𝗣𝗹𝗮𝗻 𝗾𝘂𝗮𝗹𝗶𝘁𝘆: Was the initial strategy reasonable and efficient? • 𝗔𝗱𝗮𝗽𝘁𝗮𝘁𝗶𝗼𝗻: Did the agent handle tool failures, retry intelligently, or escalate when needed? • 𝗠𝗲𝗺𝗼𝗿𝘆 𝘂𝘀𝗮𝗴𝗲: Was memory referenced meaningfully, or ignored? • 𝗖𝗼𝗼𝗿𝗱𝗶𝗻𝗮𝘁𝗶𝗼𝗻 (𝗳𝗼𝗿 𝗺𝘂𝗹𝘁𝗶-𝗮𝗴𝗲𝗻𝘁 𝘀𝘆𝘀𝘁𝗲𝗺𝘀): Did agents delegate, share information, and avoid redundancy? • 𝗦𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗼𝘃𝗲𝗿 𝘁𝗶𝗺𝗲: Did behavior remain consistent across runs or drift unpredictably? For adaptive agents or those in production, this becomes even more critical. Evaluation systems should be time-aware, tracking changes in behavior, error rates, and success patterns over time. Static accuracy alone will not explain why an agent performs well one day and fails the next. Structured evaluation is not just about dashboards. It is the foundation for improving agent design. Without clear signals, you cannot diagnose whether failure came from the LLM, the plan, the tool, or the orchestration logic. If your agents are planning, adapting, or coordinating across steps or roles, now is the time to move past simple correctness checks and build a robust, multi-dimensional evaluation framework. It is the only way to scale intelligent behavior with confidence.

  • View profile for Diego Granados
    Diego Granados Diego Granados is an Influencer

    AI Product Manager @ Google | 🚀 Interested in AI Product Management? Check my profile!

    160,469 followers

    I saw a job posting for an AI PM at Figma yesterday, and it highlights why "vibe-launching" LLM products is not enough to become an AI PM. Anyone can build an LLM-Wrapper over the weekend, but it's not enough to be an AI PM at companies like Figma, Google, Microsoft, Anthropic, and so on... The reality is, this role was never just about prompting; it’s about owning the Machine Learning lifecycle. I see a lot of aspiring AI PMs focus purely on the "creative" side of GenAI, but if you look closely at these job descriptions, they are asking for three very specific, very technical skills that define the role in 2026: 1. Beyond the "Black Box" (LLMs & ML Fundamentals) Figma asks to "prioritize model improvements." You can't do that if you don't understand what's happening under the hood. For example: 🤖 LLMs (RAG vs. Fine-Tuning): If your chatbot fails, is it a Retrieval (RAG) issue (showed the wrong doc) or a Fine-Tuning issue (wrong tone)? If you don't know the difference, you can spend too much time 'fixing' the wrong thing. 📊 Traditional ML: Think about a Netflix Recommendation System. If it recommends movies you hate, it’s likely a data issue—maybe the model only trained on your weekend habits. You need to understand how Data Collection and Training work so you can spot these bias issues before they ruin the user experience. 2. Owning the "Definition of Good" (Evals & Metrics) In traditional software, a bug is a bug. In AI, "quality" is subjective—and that is terrifying for a roadmap. That’s why you see requirements for "experience with evaluation and iteration." 🥇 LLMs (Golden Datasets): You have to move beyond "it feels good". You need to learn how to build Golden Datasets—essentially a set of ground-truth examples that you define as the perfect answers. When engineering updates the model, you run it against this dataset. If the score drops, you don't launch. 🎯 Traditional ML (Context): You need to understand why an 80% Precision score might be great for a music recommendation, but 90% could be a total disaster for a fraud detection model. 3. Scaling (Reliability & MLOps) Making a demo work for one person is easy. Scaling to 10,000 is hard. When companies ask for "scaling experience," they are talking about the unsexy stuff: Latency, Cost, and Reliability. You need to get familiar with the MLOps landscape—tools like LangSmith or Arize for tracing errors, or Datadog for monitoring latency. ---- The biggest hurdle isn't Python. It's moving from Deterministic code (If A, then B) to Probabilistic outcomes (If A, then probably B). It changes how you think about roadmaps and how you manage user expectations when you can't guarantee a specific output 100% of the time. 👋 If you’re trying to move into an AI PM role, what's the biggest challenge you are facing? --- 💎 I’ve been an AI PM for 6+ years. If you want to dive deeper into AI Product Management, check my comment below for resources!

  • View profile for Steve Isley

    Head of AI and Knowledge Management | Full Stack Builder | Entrepreneur | Research Scientist | Ex-Amazon

    2,712 followers

    𝗪𝗲 𝗻𝗲𝗲𝗱 𝗯𝗲𝘁𝘁𝗲𝗿 𝘁𝗲𝗿𝗺𝘀 𝗳𝗼𝗿 𝘁𝘄𝗼 𝗱𝗶𝘀𝘁𝗶𝗻𝗰𝘁 𝗿𝗼𝗹𝗲𝘀 𝗶𝗻 𝗔𝗜: “𝗯𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗟𝗟𝗠𝘀” 𝘃𝘀. “𝗯𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗟𝗟𝗠𝘀.” Those “building LLMs” are creating new language models and have deep machine learning expertise. Those “building with LLMs” are embedding these models into larger systems to solve specific problems. We need better terms because the skill sets are wildly different. If you’re a hiring manager with an AI project based on existing LLMs, you don’t need an AI Engineer, you need a … what? I’ve been using “Applied AI Engineer,” but I recently ran into this blog from the people at Sierra (https://lnkd.in/djReMAYm) that uses the term “Agent Engineer”. I like this term because it’s shorter, more specific in job searches, and focuses on the unique differentiator. Agent Engineers don’t need machine learning expertise. Instead, they need skills that aren’t well-defined yet. Here’s my attempt at listing some of them: ⚫ Knowledge Engineering: How will your app use data, and how will you feed it into the LLM? This includes data modeling, handling free-form text, embedding, chunking, metadata selection, and strategies for retrieval using vectors or graphs. ⚫ Cognitive Architecting: How will your app split up the work amongst LLMs? Sometimes this is simple - you don’t split up the work and just use a single LLM. However, this is rarely the case these days. Instead, a solution likely involves orchestrating dozens of separate LLMs, some with access to custom tools, some using older models to get cheaper, faster results, some using self-reflexion to iteratively improve outputs, and much more. ⚫ Prompt Engineering: This includes crafting instructions for the LLM and formatting inputs. But it’s more than that—it also involves testing and iterating prompts over time for better results. Feedback welcome. Any important skills sets you think I missed?

Explore categories