pytorch
diff --git a/‎_posts/2024-10-28-unleashing-ai-mobile.md
Lines changed: 1 addition & 0 deletions b/‎_posts/2024-10-28-unleashing-ai-mobile.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎_posts/2024-11-01-cutlass-ping-pong-gemm-kernel.md
Lines changed: 1 addition & 0 deletions b/‎_posts/2024-11-01-cutlass-ping-pong-gemm-kernel.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎_posts/2024-11-21-rebellions.md
Lines changed: 1 addition & 0 deletions b/‎_posts/2024-11-21-rebellions.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎_posts/2024-11-25-training-using-float8-fsdp2.md
Lines changed: 1 addition & 0 deletions b/‎_posts/2024-11-25-training-using-float8-fsdp2.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎_posts/2024-12-02-hadacore.md
Lines changed: 207 additions & 0 deletions b/‎_posts/2024-12-02-hadacore.md
Lines changed: 207 additions & 0 deletions
diff --git a/‎assets/images/hadacore/fg1.png
247 KB b/‎assets/images/hadacore/fg1.png
247 KB
diff --git a/‎assets/images/hadacore/fg2.png
59.7 KB b/‎assets/images/hadacore/fg2.png
59.7 KB
diff --git a/‎assets/images/hadacore/fg3.png
22.1 KB b/‎assets/images/hadacore/fg3.png
22.1 KB
diff --git a/‎assets/images/hadacore/fg4.png
247 KB b/‎assets/images/hadacore/fg4.png
247 KB
diff --git a/‎assets/images/hadacore/fg5.png
81.4 KB b/‎assets/images/hadacore/fg5.png
81.4 KB
@@ -2,6 +2,7 @@
 layout: blog_detail
 title: "Unleashing the Power of AI on Mobile: LLM Inference for Llama 3.2 Quantized Models with ExecuTorch and KleidiAI"
 author: Gian Marco Iodice, Arm and Digant Desai, Meta
+excerpt: "At the recent PyTorch Conference, Arm highlighted the widespread impact of its technology, spanning from cloud to edge, emphasizing its commitment to delivering its advanced AI computing capabilities seamlessly to millions of developers worldwide."
 ---
 
 ## Introduction
 
@@ -2,6 +2,7 @@
 layout: blog_detail
 title: "Deep Dive on CUTLASS Ping-Pong GEMM Kernel"
 author: Less Wright, Adnan Hoque
+excerpt: "In this post, we provide an overview, with relevant FP8 inference kernel benchmarking, of the CUTLASS Ping-Pong GEMM kernel."
 ---
 
 ![Figure 1. FP8 GEMM Throughput Comparison CUTLASS vs Triton](/assets/images/cutlass-ping-pong-gemm-kernel/fg1.png){:style="width:100%"}
 
@@ -1,6 +1,7 @@
 ---
 layout: blog_detail
 title: "Rebellions Joins the PyTorch Foundation as a General Member"
+excerpt: "The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Rebellions has joined as a general member."
 ---
 
 ![Rebellions logo](/assets/images/rebellions-logo.svg){:style="max-width:350px;width:100%;float:right;margin: 20px;"}
 
@@ -2,6 +2,7 @@
 layout: blog_detail
 title: "Supercharging Training using float8 and FSDP2"
 author: "IBM and Meta"
+excerpt: "In this blog, we will demonstrate how we achieve up to 50% throughput speedup while achieving loss and evaluation benchmark parity in training over FSDP1 bf16 training"
 ---
 
 **IBM**: Tuan Hoang Trong, Alexei Karve, Yan Koyfman, Linsong Chu, Divya Kumari, Shweta Salaria, Robert Walkup, Praneet Adusumilli, Nirmit Desai, Raghu Ganti, Seetharami Seelam  
 
@@ -0,0 +1,207 @@
+---
+layout: blog_detail
+title: "HadaCore: Tensor Core Accelerated Hadamard Transform Kernel"
+author: "IBM and Meta"
+excerpt: "Quantization is a method for improving model inference speeds by compressing model weights and performing (faster) computation in lower precision data types. However, quantization can result in accuracy loss due to the presence of outliers."
+---
+
+**IBM**: Krish Agarwal, Rishi Astra, Adnan Hoque, Mudhakar Srivatsa, Raghu Ganti  
+**Meta**: Less Wright, Sijia Chen
+
+Quantization is a method for improving model inference speeds by compressing model weights and performing (faster) computation in lower precision data types. However, quantization can result in accuracy loss due to the presence of outliers. Recent works like [QuaRot](https://arxiv.org/abs/2404.00456), [SpinQuant](https://arxiv.org/abs/2405.16406), and [FlashAttention-3](https://arxiv.org/pdf/2407.08608) introduce methods to increase the numerical accuracy of INT4, INT8 and FP8 quantization in LLMs. These methods rely on [Hadamard Transforms](https://en.wikipedia.org/wiki/Hadamard_transform). In this blog, we present HadaCore, a Hadamard Transform CUDA kernel that achieves state-of-the-art performance on NVIDIA A100 and H100 GPUs. Our kernel achieves speedups of **1.1–1.4x** and **1.0–1.3x**, with a peak gain of **3.5x** and **3.6x** respectively, over Dao AI Lab’s [Fast Hadamard Transform Kernel](https://github.com/Dao-AILab/fast-hadamard-transform). We leverage a hardware-aware work decomposition that benefits from Tensor Core acceleration while maintaining quantization error reduction.
+
+
+
+![Figure 1: Speedup of HadaCore vs Dao AI Hadamard CUDA kernel. A peak gain of 3.46x on the A100 is achieved using 128 rotation by 8.4M elements.](/assets/images/hadacore/fg1.png){:style="width:100%"}
+
+*Figure 1: Speedup of HadaCore vs Dao AI Hadamard CUDA kernel. A peak gain of 3.46x on the A100 is achieved using 128 rotation by 8.4M elements.*
+
+The [HadaCore Kernel is publicly available](https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/inference/hadamard_transform).
+
+## Background
+
+[QuaRot](https://arxiv.org/abs/2404.00456) and [SpinQuant](https://arxiv.org/abs/2405.16406) both propose methods to increase the numerical accuracy of INT4 and INT8 quantization in LLMs. Both methods rotate model activations since rotations are statistically likely to reduce the magnitude of outliers, as it “distributes” extreme values among other (less extreme) dimensions, and rotation is also an easily invertible operation using the inverse of the rotation matrix. These methods can also improve FP8 inference accuracy, such as in [FlashAttention-3](https://arxiv.org/pdf/2407.08608).
+
+
+![Figure 2. Transformer block showing online (red) and offline rotations (blue) in QuaRot](/assets/images/hadacore/fg2.png){:style="width:100%"}
+
+
+*Figure 2. Transformer block showing online (red) and offline rotations (blue) in QuaRot*
+
+Applying these rotation matrices introduces model runtime overhead due to the online operations shown in Figure 2. These rotations can be applied through matrix multiplication, but the added overhead would diminish the benefits from quantization. Therefore, QuaRot and SpinQuant opt to use Walsh-Hadamard matrices, a special type of rotation matrix that can be applied faster than matrix multiplication using the [Fast Walsh-Hadamard Transform](https://en.wikipedia.org/wiki/Fast_Walsh%E2%80%93Hadamard_transform) algorithm. HadaCore is an optimized implementation of this algorithm for NVIDIA GPUs that support Tensor Cores.
+
+## Tensor Core Accelerated Hadamard Transform
+
+HadaCore leverages [NVIDIA Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/), which are specialized compute units on NVIDIA GPUs optimized for matrix multiplication. To achieve this, our kernel performs a hardware-aware work decomposition of the Fast Walsh-Hadamard algorithm. This work decomposition ensures that we can utilize the [MMA PTX instructions](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=mma#multiply-and-accumulate-instruction-mma) that execute on the Tensor Core chip. HadaCore applies a 16×16 Hadamard transform to chunks of the input data. The computation can then be offloaded to the FP16 Tensor Core with usage of the [mma.m16n8k16](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=mma#matrix-fragments-for-mma-m16n8k16-with-floating-point-type) instruction. The warp-level parallelism for HadaCore is shown below.
+
+
+![Figure 3: HadaCore Parallelization, 1x256 vectors (rows) being rotated by a size 256 Hadamard.](/assets/images/hadacore/fg3.png){:style="width:100%"}
+
+
+*Figure 3: HadaCore Parallelization, 1x256 vectors (rows) being rotated by a size 256 Hadamard.*
+
+We process fragments of 256 elements in parallel using warp-level Tensor Core operations to achieve up to a 256-size Hadamard transform. For further sizes, we shuffle data between warps and repeat.
+
+## Microbenchmarks
+
+We benchmark HadaCore against the[ Dao AI Lab Hadamard Kernel](https://github.com/Dao-AILab) on both NVIDIA H100 and A100 GPUs across varying Hadamard and input tensor sizes.
+
+![Figure 4:  HadaCore Kernel Speedup on NVIDIA A100 over Dao AI Lab Fast Hadamard Kernel](/assets/images/hadacore/fg4.png){:style="width:100%"}
+
+
+
+*Figure 4:  HadaCore Kernel Speedup on NVIDIA A100 over Dao AI Lab Fast Hadamard Kernel*
+
+
+![Color coded Speedup Table for NVIDIA A100, Green = Speedup over Baseline](/assets/images/hadacore/fg5.png){:style="width:100%; margin-top: 35px;"}
+
+
+*Color coded Speedup Table for NVIDIA A100, Green = Speedup over Baseline*
+
+
+![Figure 5:  HadaCore Kernel Speedup on NVIDIA H100 over Dao AI Lab Fast Hadamard Kernel](/assets/images/hadacore/fg6.png){:style="width:100%; margin-top: 35px;"}
+
+
+*Figure 5:  HadaCore Kernel Speedup on NVIDIA H100 over Dao AI Lab Fast Hadamard Kernel*
+
+
+![Color coded Speedup Table for NVIDIA H100, Green = Speedup over Baseline](/assets/images/hadacore/fg7.png){:style="width:100%; margin-top: 35px;"}
+
+
+*Color coded Speedup Table for NVIDIA H100, Green = Speedup over Baseline*
+
+We showcase our speedup as the input tensor size (labeled element count) in our charts increase. Element count is the number of elements in the target matrix we are rotating. For example, in multi-head attention: 
+
+
+The queries (Q), keys (K) and values (V) tensors are 4D tensors of size: 
+
+`(batch_size, seq_len, n_heads, head_dim)`
+
+A Hadamard matrix of size `head_dim` is applied to these activation tensors, so we refer to this as using a Hadamard size of `head_dim` with an element count of:
+
+`batch_size*seq_len*n_heads*head_dim.`
+
+Common element counts for query rotations in an attention block:
+
+
+<table class="table table-bordered">
+  <tr>
+   <td><strong>Model \ Tokens</strong>
+   </td>
+   <td><strong>Prefill</strong>
+   </td>
+   <td><strong>Decoding</strong>
+   </td>
+  </tr>
+  <tr>
+   <td><strong>Llama-2 70b</strong>
+   </td>
+   <td>33,554,432 elements
+<br>
+128 Hadamard size
+<br>
+
+(1 batch * 64 heads * 4096 tokens * 128 dimensional embeddings per head per token)
+   </td>
+   <td>8192 elements
+<br>
+128 Hadamard size
+<br>
+(1 batch * 64 heads * 1 token * 128 dimensional embeddings per head per token)
+   </td>
+  </tr>
+  <tr>
+   <td><strong>Llama-3 8b</strong>
+   </td>
+   <td>33,554,432 elements
+<br>
+128 Hadamard size
+<br>
+(1 batch * 32 heads * 8192 tokens * 128 dimensional embeddings per head per token)
+   </td>
+   <td>4,096 elements
+<br>
+128 Hadamard size
+<br>
+(1 batch * 32 heads * 1 token * 128 dimensional embeddings per head per token)
+   </td>
+  </tr>
+</table>
+
+
+HadaCore achieves **1.1–1.4x** speedup on A100 and **1.0–1.3x** speedup on H100 over Dao AI Lab’s Fast Hadamard kernel, with a peak gain of **3.5x and 3.6x**, respectively. For smaller sizes on H100, HadaCore’s gain decreases. For future work, we plan to incorporate usage of Hopper specific features like TMA and WGMMA for improved H100 performance.
+
+## MMLU Benchmarks
+
+We evaluated MMLU scores on a [Llama 3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) inference workload where the FlashAttention computation was performed in FP8. Newer generation [NVIDIA Hopper GPUs ](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)come equipped with FP8 Tensor Cores that deliver substantial compute gain over FP16. 
+
+Our results show the benefit of using HadaCore for accuracy preservation when combined with optimizations such as FP8 FlashAttention.
+
+
+<table class="table table-bordered">
+  <tr>
+   <td><strong>Format</strong>
+   </td>
+   <td><strong>Method</strong>
+   </td>
+   <td><strong>Llama3.1-8B</strong>
+<br>
+<strong>Avg. 5-Shot MMLU Accuracy</strong>
+   </td>
+  </tr>
+  <tr>
+   <td><strong>Q, K, V: FP16</strong>
+<br>
+<strong>FlashAttention: FP16</strong>
+   </td>
+   <td>N/A
+   </td>
+   <td>65.38
+   </td>
+  </tr>
+  <tr>
+   <td><strong>Q, K, V: FP16</strong>
+<br>
+<strong>FlashAttention: FP8</strong>
+   </td>
+   <td>No Hadamard
+   </td>
+   <td>64.40
+   </td>
+  </tr>
+  <tr>
+   <td><strong>Q, K, V: FP8</strong>
+<br>
+<strong>FlashAttention: FP8</strong>
+   </td>
+   <td>HadaCore
+   </td>
+   <td>65.09
+   </td>
+  </tr>
+  <tr>
+   <td><strong>Q, K, V: FP8</strong>
+<br>
+<strong>FlashAttention: FP8</strong>
+   </td>
+   <td>Dao AI Fast Hadamard Kernel
+   </td>
+   <td>65.45
+   </td>
+  </tr>
+</table>
+
+
+*Table 1: MMLU scores for Llama3.1 8B with FP16 baseline and FP8 attention using Hadamard transforms, comparing an implementation with explicit Hadamard matrix multiplications vs. HadaCore (**higher is better**)*
+
+From the above MMLU scores, we note that for Llama3.1-8B inference with FP8 attention, HadaCore improves the quantization error introduced from computing attention in a lower precision.
+
+## Conclusion
+
+We showcased our speedups achieved by moving the Fast-Walsh Hadamard algorithm into a CUDA kernel that leverages Tensor Core acceleration and achieves a peak speedup of **3.5x** and **3.6x** over the Dao AI Fast-Hadamard kernel on NVIDIA A100 and H100, respectively. 
+
+Further, we showed on the MMLU benchmark that rotating with HadaCore maintains similar quantization error reduction to the Fast-Hadamard kernel, while providing computational acceleration.
+
+## Future Work
+
+We plan to implement a Triton version of our kernel and experiment with more advanced techniques such as kernel fusion to support fused Hadamard transform and quantization. Further, we plan to extend our kernel to support BF16 Tensor Core compute.