What's new and noteworthy in Google's newly released Gemma 2 LLMs? The main theme is that they explore techniques without necessarily increasing the size of training datasets but rather focus on developing relatively small and efficient LLMs. In particular, they blend three main architecture and training choices to create the 2B and 9B parameter models: 1) Sliding window attention (e.g., as popularized by Mistral): This technique uses a fixed-sized attention block that allows a current token to attend to only a specific number of previous tokens instead of all previous tokens, as illustrated in the figure below. 2) Group-query attention (like in Llama 2 and 3): This can be regarded as a more generalized form of multi-query attention. The motivation behind this is to reduce the number of trainable parameters by sharing the same Keys and Values heads for multiple Query heads, thereby lowering computational requirements. 3) Knowledge distillation (as in MiniLLM): The general idea is to transfer knowledge from a larger model (the teacher) to a smaller model (the student). Here, they trained a 27B (teacher) model from scratch and then trained the smaller 2B and 9B (student) models on the outputs of the larger teacher model. The 27B model doesn't use knowledge distillation but was trained from scratch to serve as a "teacher" for the smaller models. There's also an interesting section on "logit capping," a technique I haven't seen used before. Essentially, it is a form of min-max normalizing and clipping of the logit values to keep them within a certain range. I presume this is to improve stability and gradient flow during training. Additionally, they leverage model merging techniques to combine models from multiple runs with different hyperparameters, although there isn't much detail about that in the paper. In terms of modeling performance, Gemma 2 is almost as good as the 3x larger Llama 3 70B, and it beats the old Qwen 1.5 32B model. It would be interesting to see a comparison with the more recent Qwen 2 model. Personally, a highlight is that the Gemma 2 report includes ablation studies for some of its architectural choices. This was once a given in academic research but is increasingly rare for LLM research. And here's a link to the Gemma 2 technical report for additional details: https://lnkd.in/gAe4yewy
Machine Learning Model Tuning
Explore top LinkedIn content from expert professionals.
-
-
What is 𝗠𝗮𝗰𝗵𝗶𝗻𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗺𝗼𝗱𝗲𝗹 𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 and why you might need it? When you deploy Machine Learning models to production you need to take into account several operational metrics that are in general not ML related. Today we talk about two of them: 👉 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗟𝗮𝘁𝗲𝗻𝗰𝘆: How long does it take for your Model to compute inference result and return it. 👉 𝗠𝗼𝗱𝗲𝗹 𝗦𝗶𝘇𝗲: How much memory does your model occupy when it’s loaded for serving inference results. Both of these are important when considering operational performance and feasibility of your model deployment in production. 👉 Large models might not fit on a device if you are considering edge deployments. 👉 Latency of retrieving inference results might make business case non feasible. E.g. Recommendation Engines require latencies in milliseconds as ranking has to be applied as the user browses your website or app in real time. 👉 … You can influence both latency and size by applying different Model Compression methods, some of them are: ➡️ 𝗣𝗿𝘂𝗻𝗶𝗻𝗴: this method is mostly used in tree-based and Neural Network algorithms. In tree-based ones we prune leaves or branches from decision trees. In Neural Networks we remove nodes and synapses (weights) while trying to retain ML performance metrics. ✅ In both cases the output is a reduction in the number of Model Parameters and model size. ➡️ 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗗𝗶𝘀𝘁𝗶𝗹𝗹𝗮𝘁𝗶𝗼𝗻: this type of compression is achieved by: 👉 Training an original large model which is called the Teacher model. 👉 Training a smaller model to mimic the Teacher model by transferring knowledge from it, this model is called the Student model. Knowledge in this context can be extracted from the outputs, internal hidden state (feature representations) or a combination of both. 👉 We then use the “Student” model in production. ➡️ 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻: a most commonly used method that doesn’t have much to do with Machine Learning. This approach uses fewer bits to represent model parameters. 👉 You can apply quantization techniques both during the training and after the models has been already trained. 👉 In regular Neural Networks what is quantized are Model Weights, Biases and Activation Functions. 👉 Most usual quantization is from float to integer (32 bits to 8 bits). ➡️ … [𝗜𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁]: while the above methods do reduce the size of the models, allowing them to be deployed in production scenarios, there is almost always a reduction in accuracy so be careful and evaluate it accordingly. -------- Follow me to upskill in #MLOps, #MachineLearning, #DataEngineering, #DataScience and overall #Data space. 𝗗𝗼𝗻’𝘁 𝗳𝗼𝗿𝗴𝗲𝘁 𝘁𝗼 𝗹𝗶𝗸𝗲 👍, 𝘀𝗵𝗮𝗿𝗲 𝗮𝗻𝗱 𝗰𝗼𝗺𝗺𝗲𝗻𝘁! Join a growing community of Data Professionals by subscribing to my 𝗡𝗲𝘄𝘀𝗹𝗲𝘁𝘁𝗲𝗿/𝗕𝗹𝗼𝗴.
-
In this new Machine Learning era dominated by LLMs, knowledge Distillation is going to be at the forefront of LLMOps. For widespread adoption and further development of generative ML, we first need to make those models more manageable to deploy and fine-tune. Just to put some numbers on how unmanageable it can be, SOTA models these days have about ~500B parameters and that represents at least ~1TB of GPU memory to operate with specialized infrastructure. That's a minimum of ~$60,000 - $100,000 per year per deployed model just for inference servers. And that doesn't include fine-tuning nor typical elastic load balancing costs for reliability best practices. Not impossible, but somewhat a high barrier to entry for most businesses. I always felt that knowledge distillation was a silent hero in this era of transformer-type language models. There are tons of distilled BERT-like models on HuggingFace for example. The concept behind distillation is actually pretty simple. Let's assume you have a large pre-trained model. Your pre-trained model now becomes the all-knowing teacher for a smaller student model. If we call the teacher model T and the student model S, we want to learn the parameters for S such that T(x) = y_t ~ y_s = S(x) For some data x, we want the predictions y_t and y_s by T and S to be as close to each other. To train a student model, we simply pass the training data through the teacher and student and update the student's parameters by minimizing the loss function l(y_t, y_s) and back-propagating its gradient. Typically we use cross-entropy for the loss function. To train the student model, think about typical supervised learning where the training data is the same or similar to the teacher's training data, but the ground truth label for the student model is the output prediction of the teacher model. You can read more about it in this survey: https://lnkd.in/gCmzGDhq. With the advent of prompt engineering, we now understand better how to extract the right piece of knowledge from LLMs. Techniques like Chain-of-Thought (CoT) greatly improved LLMs performance on few-shot learning tasks. The guys at Google just published an article (https://lnkd.in/gfjwhbq3) utilizing CoT to improve the distillation process. The idea is to have the student LLM predicting the rationales for the predictions alongside the predictions and minimizing a loss function between the teacher rationale and the student rationale. Basically by forcing the LLM to explain its predictions, they were able to beat all the distillation SOTA. For example, they outperformed a 540B parameters PaLM model with a 770M parameters T5 model after distillation! I think this paper will have a huge impact in the coming year! ---- Find more similar content in my newsletter: TheAiEdge.io Next ML engineering Masterclass starting July 29th: MasterClass.TheAiEdge.io #machinelearning #datascience #artificialintelligence
-
If you are an AI engineer, thinking how to choose the right foundational model, this one is for you 👇 Whether you’re building an internal AI assistant, a document summarization tool, or real-time analytics workflows, the model you pick will shape performance, cost, governance, and trust. Here’s a distilled framework that’s been helping me and many teams navigate this: 1. Start with your use case, then work backwards. Craft your ideal prompt + answer combo first. Reverse-engineer what knowledge and behavior is needed. Ask: → What are the real prompts my team will use? → Are these retrieval-heavy, multilingual, highly specific, or fast-response tasks? → Can I break down the use case into reusable prompt patterns? 2. Right-size the model. Bigger isn’t always better. A 70B parameter model may sound tempting, but an 8B specialized one could deliver comparable output, faster and cheaper, when paired with: → Prompt tuning → RAG (Retrieval-Augmented Generation) → Instruction tuning via InstructLab Try the best first, but always test if a smaller one can be tuned to reach the same quality. 3. Evaluate performance across three dimensions: → Accuracy: Use the right metric (BLEU, ROUGE, perplexity). → Reliability: Look for transparency into training data, consistency across inputs, and reduced hallucinations. → Speed: Does your use case need instant answers (chatbots, fraud detection) or precise outputs (financial forecasts)? 4. Factor in governance and risk Prioritize models that: → Offer training traceability and explainability → Align with your organization’s risk posture → Allow you to monitor for privacy, bias, and toxicity Responsible deployment begins with responsible selection. 5. Balance performance, deployment, and ROI Think about: → Total cost of ownership (TCO) → Where and how you’ll deploy (on-prem, hybrid, or cloud) → If smaller models reduce GPU costs while meeting performance Also, keep your ESG goals in mind, lighter models can be greener too. 6. The model selection process isn’t linear, it’s cyclical. Revisit the decision as new models emerge, use cases evolve, or infra constraints shift. Governance isn’t a checklist, it’s a continuous layer. My 2 cents 🫰 You don’t need one perfect model. You need the right mix of models, tuned, tested, and aligned with your org’s AI maturity and business priorities. ------------ If you found this insightful, share it with your network ♻️ Follow me (Aishwarya Srinivasan) for more AI insights and educational content ❤️
-
In enterprise AI - '23 was the mad rush to a flashy demo - '24 will be all about getting to real production value Three key steps for this in our experience: - (1) Develop your "micro" benchmarks - (2) Develop your data - (3) Tune your entire LLM system- not just the model 1/ Develop your "micro" benchmarks: - "Macro" benchmarks e.g. public leaderboards dominate the dialogue - But what matters for your use case is a lot narrower - Must be defined iteratively by business/product and data scientist together! Building these "unit tests" is step 1. 2/ Develop your data: - Whether via a prompt or fine-tuning/alignment, the key is the data in, and how you develop it - Develop = label, select/sample, filter, augment, etc. - Simple intuition: would you dump a random pile of books on a student's desk? Data curation is key. 3/ Tune your entire LLM system- not just the model: - AI use cases generally require multi-component LLM systems (eg. LLM + RAG) - These systems have multiple tunable components (eg. LLM, retrieval model, embeddings, etc) - For complex/high value use cases, often all need tuning 4/ For all of these steps, AI data development is at the center of getting good results. Check out how we make this data development programmatic and scalable for real enterprise use cases @SnorkelAI snorkel.ai :)
-
𝗘𝘅𝗽𝗹𝗮𝗶𝗻 𝗧𝗵𝗶𝘀: 𝗟𝗹𝗮𝗺𝗮 𝟯 𝗡𝗲𝗲𝗱𝘀 𝟮.𝟰𝗧𝗕. 𝗬𝗼𝘂𝗿 𝗚𝗣𝗨 𝗛𝗮𝘀 𝟴𝟬𝗚𝗕. 𝗜𝘁 𝗦𝘁𝗶𝗹𝗹 𝗧𝗿𝗮𝗶𝗻𝘀. Training Llama-3 405B needs ~2.4TB with BF16 + 8-bit Adam: • Weights: 810GB • Gradients: 810GB • Optimizer: 810GB (vs 3.24TB with standard Adam!) • Total: ~2.4TB (Illustrative budget—config-dependent; FP32 masters, ZeRO stage, and offload change totals) Your H100? 80GB. You'd need 30+ GPUs just to hold everything. 𝗧𝗵𝗿𝗲𝗲 𝗧𝗿𝗶𝗰𝗸𝘀 𝗧𝗵𝗮𝘁 𝗠𝗮𝗸𝗲 𝗜𝘁 𝗪𝗼𝗿𝗸 𝟭. 𝗗𝗮𝘁𝗮 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹: Split batch. Problem: Each GPU needs 2.4TB. Fix: ZeRO splits it across N GPUs. 𝟮. 𝗠𝗼𝗱𝗲𝗹 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹: Split layers. Problem: Sequential bottleneck. Fix: Pipeline batches. 𝟯. 𝗦𝗲𝗾𝘂𝗲𝗻𝗰𝗲 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹: Split tokens. This is the game changer. 8K tokens → 8 GPUs → 1K each. But attention needs every token to see all others. 𝗧𝗵𝗲 𝗠𝗮𝗴𝗶𝗰 𝗠𝗼𝗺𝗲𝗻𝘁: Instead of moving the 2.4TB model, GPUs only exchange attention keys/values (K,V). Each GPU: • Computes K,V for its 1K tokens (32MB) • Sends to others via all-to-all • Receives 7×32MB = 224MB total • Computes attention, deletes copies 𝟮𝟮𝟰𝗠𝗕 𝗺𝗼𝘃𝗲𝗱 𝗶𝗻𝘀𝘁𝗲𝗮𝗱 𝗼𝗳 𝟮.𝟰𝗧𝗕. That's 10,000x less. 𝗧𝗵𝗲 𝗥𝗲𝘀𝘂𝗹𝘁: Combine all three (ZeRO + tensor + pipeline + sequence parallel). Each GPU holds ~75GB instead of 2.4TB. This exact choreography powers ChatGPT, Claude, and every frontier model. Without it? 10K token limits. With it? Entire books in one context. Not magic. Just brilliant engineering making the impossible routine.
-
Many teams overlook critical data issues and, in turn, waste precious time tweaking hyper-parameters and adjusting model architectures that don't address the root cause. Hidden problems within datasets are often the silent saboteurs, undermining model performance. To counter these inefficiencies, a systematic data-centric approach is needed. By systematically identifying quality issues, you can shift from guessing what's wrong with your data to taking informed, strategic actions. Creating a continuous feedback loop between your dataset and your model performance allows you to spend more time analyzing your data. This proactive approach helps detect and correct problems before they escalate into significant model failures. Here's a comprehensive four-step data quality feedback loop that you can adopt: Step One: Understand Your Model's Struggles Start by identifying where your model encounters challenges. Focus on hard samples in your dataset that consistently lead to errors. Step Two: Interpret Evaluation Results Analyze your evaluation results to discover patterns in errors and weaknesses in model performance. This step is vital for understanding where model improvement is most needed. Step Three: Identify Data Quality Issues Examine your data closely for quality issues such as labeling errors, class imbalances, and other biases influencing model performance. Step Four: Enhance Your Dataset Based on the insights gained from your exploration, begin cleaning, correcting, and enhancing your dataset. This improvement process is crucial for refining your model's accuracy and reliability. Further Learning: Dive Deeper into Data-Centric AI For those eager to delve deeper into this systematic approach, my Coursera course offers an opportunity to get hands-on with data-centric visual AI. You can audit the course for free and learn my process for building and curating better datasets. There's a link in the comments below—check it out and start transforming your data evaluation and improvement processes today. By adopting these steps and focusing on data quality, you can unlock your models' full potential and ensure they perform at their best. Remember, your model's power rests not just in its architecture but also in the quality of the data it learns from. #data #deeplearning #computervision #artificialintelligence
-
An explanation of language model distillation, how it works, why it’s useful, and examples of how you can perform distillation. What is distillation? Distillation is a model compression technique where a smaller "student" model is trained to mimic the behavior of a larger "teacher" model. This is achieved by transferring knowledge from the teacher to the student, usually through methods like logit-based or hidden states-based distillation. These methods are designed to help the student model replicate the teacher's output distribution or internal representations, often leading to a more efficient model with comparable performance. When would we use this? Distillation is commonly used when deploying large models is impractical due to resource constraints, such as in real-time applications or edge devices. For instance, a smaller student model can be distilled from a powerful teacher model like Llama3.1 405B, retaining much of the original model’s capability but with significantly lower computational demands. Distillation is also useful when adapting models to specific tasks or domains, as seen in domain-specific distillation cases like "function calling," where specialized knowledge from a teacher model is transferred to a smaller model for specific use cases. What’s the benefit? Distillation offers a significant reduction in model size and computational requirements while maintaining a high level of performance. This is especially valuable in scenarios where memory and processing power are limited. Moreover, distillation allows for flexibility in model architecture choices; for example, distilling knowledge from a Llama-3.1-70B model into a much smaller StableLM-2-1.6B model. Distillation methods like those provided in Arcee-AI's DistillKit, including logit-based and hidden states-based distillation, can lead to substantial performance gains over traditional training routines without requiring additional data. Examples of Distillation Techniques: (1) Logit-based Distillation: This method involves transferring knowledge by using both the hard targets (actual labels) and soft targets (teacher logits) to guide the student model. The student is trained to minimize the difference between its output distribution and the teacher’s output, typically using Kullback-Leibler (KL) divergence. This method is particularly effective for maintaining performance close to the teacher model while improving the student’s generalization abilities. (2) Hidden States-based Distillation: Here, the focus is on aligning the intermediate layer representations of the student with those of the teacher. This layer-wise guidance helps the student model capture similar features and improves its performance and generalization. This method also allows for cross-architecture distillation, enabling knowledge transfer between different model architectures, such as distilling from a Llama-3.1-70B model into a StableLM-2-1.6B model.
-
When should you use what combination of RAG, Fine-Tuning, and Prompt Engineering? Here's your cheat sheet: —— Prompt Engineering As upcoming podcast guest Hamel Husain says: "Prompt engineering is just prompting these days." We all have to prompt, but when it comes to building the right prompt for your product feature, prompt engineering is critical. It goes beyond simple clarification. It’s about transforming the model’s output with additional training or data retrieval. It’s about better activating a model’s existing capabilities. Pros: • You don’t need to change backend infrastructure • You get to see immediate responses and results to what you do - no new training data or data processing required Cons: • Trial and error - it’s as much an art as a science • You’re limited to the model’s existing knowledge, unable to add new or proprietary data —— Fine-tuning Fine-tuning takes an existing foundation model and gives it specialized 'graduate-level' training on a focused dataset relevant to your specific needs. You're subtly adjusting the model's internal 'weights' (its understanding of relationships in data) to make it an expert in a particular domain, style, or task. This typically involves providing hundreds or thousands of high-quality input-output examples. Pros: • Great when you need deep domain expertise or consistent tone/style • Faster at inference time than RAG because it doesn’t need to search through external data and don’t need to maintain a separate vector database Cons: • Issues with the training complexity - need 1000s of examples • There are significant computational and maintenance costs involved • You risk "catastrophic forgetting," where the model loses some general capabilities as it becomes more specialized —— RAG Retrieval Augmented Generation is like giving your LLM real-time access to a specific, curated library of information – your product docs, a knowledge base, recent news, etc. When a user asks a question, the RAG system first retrieves relevant snippets from this external library and then feeds that context to the LLM along with the original query. The LLM then uses this fresh, targeted information to generate its answer. Pros: • Good for up-to-date information • Good for adding domain-specific information Cons: • Performance impact - retrieval adds latency to each prompt (typically 100-500ms) • Processing costs - eg for the vector database —— The key is to pick the right combination based on what you need - not just to adopt fancy tools. If you want to learn how to step-by-step, check out my post: https://lnkd.in/ebfnDUmi P.S. What are you using in your AI products?
-
Ever pondered the choice between Retrieval-Augmented Generation (RAG) and fine-tuning your AI models? As we stand on the brink of AI transformation, this decision has never been more crucial. Both techniques aim to update an LLM's knowledge with the latest data, but choosing the right path could significantly impact your project's success. Here's a distilled guide to when each method shines: RAG: Ideal for diving deep into topics, pulling from a vast sea of information for context-rich, informed responses. Think of it as your go-to for nuanced understanding, where the breadth and depth of knowledge are key. It's a bit heavier on the computational side but invaluable for tasks requiring layers of context. Fine-tuning: The sprinter of the two, fine-tuning is your pick for speed and task-specific sharpness. If your project is a real-time application like a chatbot or demands specialized knowledge, fine-tuning trims the sails for swift, accurate responses tailored to your needs. The Verdict: While it's tempting to declare a winner, the reality is both RAG and fine-tuning have their place in the AI toolkit. RAG excels in enriching responses with external data, making it a powerhouse for dynamic, evolving tasks. Fine-tuning, on the other hand, optimizes for specific objectives, sharpening your AI's focus like a laser. The decision boils down to your project's unique requirements: the need for up-to-the-minute information, the depth of context required, and whether your focus is on broad understanding or pinpoint accuracy.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development