From 4397459ab00c71653003371bbf67476a86298859 Mon Sep 17 00:00:00 2001 From: Andrew Bringaze Linux Foundation Date: Thu, 26 Sep 2024 15:55:19 -0500 Subject: [PATCH 1/4] space addition --- _posts/2024-09-26-pytorch-native-architecture-optimization.md | 1 + 1 file changed, 1 insertion(+) diff --git a/_posts/2024-09-26-pytorch-native-architecture-optimization.md b/_posts/2024-09-26-pytorch-native-architecture-optimization.md index fcf5122e970e..4ded3d1d32d9 100644 --- a/_posts/2024-09-26-pytorch-native-architecture-optimization.md +++ b/_posts/2024-09-26-pytorch-native-architecture-optimization.md @@ -4,6 +4,7 @@ title: "PyTorch Native Architecture Optimization: torchao" author: Team PyTorch --- + We’re happy to officially launch torchao, a PyTorch native library that makes models faster and smaller by leveraging low bit dtypes, quantization and sparsity. [torchao](https://github.com/pytorch/ao) is an accessible toolkit of techniques written (mostly) in easy to read PyTorch code spanning both inference and training. This blog will help you pick which techniques matter for your workloads. We benchmarked our techniques on popular GenAI models like LLama 3 and Diffusion models and saw minimal drops in accuracy. Unless otherwise noted the baselines are bf16 run on A100 80GB GPU. From bafccee45474ae567d08ea5523d7ba002a5005a8 Mon Sep 17 00:00:00 2001 From: Andrew Bringaze Linux Foundation Date: Thu, 26 Sep 2024 22:56:58 -0500 Subject: [PATCH 2/4] add image 3, code color test --- .../2024-09-26-pytorch-native-architecture-optimization.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/_posts/2024-09-26-pytorch-native-architecture-optimization.md b/_posts/2024-09-26-pytorch-native-architecture-optimization.md index 4ded3d1d32d9..6d0a7d698f62 100644 --- a/_posts/2024-09-26-pytorch-native-architecture-optimization.md +++ b/_posts/2024-09-26-pytorch-native-architecture-optimization.md @@ -44,6 +44,7 @@ model \= torchao.autoquant(torch.compile(model, mode='max-autotune')) quantize\_ API has a few different options depending on whether your model is compute bound or memory bound. +```py from torchao.quantization import ( \# Memory bound models int4\_weight\_only, @@ -57,7 +58,7 @@ from torchao.quantization import ( float8\_weight\_only, float8\_dynamic\_activation\_float8\_weight, ) - +``` We also have extensive benchmarks on diffusion models in collaboration with the HuggingFace diffusers team in [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao) where we demonstrated 53.88% speedup on Flux.1-Dev and 27.33% speedup on CogVideoX-5b @@ -73,7 +74,7 @@ But also can do things like quantize weights to int4 and the kv cache to int8 to Post training quantization, especially at less than 4 bit can suffer from serious accuracy degradations. Using [Quantization Aware Training](https://pytorch.org/blog/quantization-aware-training/) (QAT) we’ve managed to recover up to 96% of the accuracy degradation on hellaswag. We’ve integrated this as an end to end recipe in torchtune with a minimal [tutorial](https://github.com/pytorch/ao/tree/main/torchao/quantization/prototype/qat) -![](/assets/images/Figure_3.png){:style="width:100%"} +![](/assets/images/Figure_3.jpg){:style="width:100%"} # Training @@ -116,8 +117,6 @@ We’ve been actively working on making sure torchao works well in some of the m 5. In [torchchat](https://github.com/pytorch/torchchat) for post training quantization 6. In SGLang for for [int4 and int8 post training quantization](https://github.com/sgl-project/sglang/pull/1341) -# - ## Conclusion If you’re interested in making your models faster and smaller for training or inference, we hope you’ll find torchao useful and easy to integrate. From 5f86ce9da01b68a722e0c782dd327ba9547bc63d Mon Sep 17 00:00:00 2001 From: Andrew Bringaze Linux Foundation Date: Thu, 26 Sep 2024 23:01:56 -0500 Subject: [PATCH 3/4] finished code coloring --- _posts/2024-09-26-pytorch-native-architecture-optimization.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/_posts/2024-09-26-pytorch-native-architecture-optimization.md b/_posts/2024-09-26-pytorch-native-architecture-optimization.md index 6d0a7d698f62..3a831f5d867c 100644 --- a/_posts/2024-09-26-pytorch-native-architecture-optimization.md +++ b/_posts/2024-09-26-pytorch-native-architecture-optimization.md @@ -32,15 +32,19 @@ Below we'll walk through some of the techniques available in torchao you can app [Our inference quantization algorithms](https://github.com/pytorch/ao/tree/main/torchao/quantization) work over arbitrary PyTorch models that contain nn.Linear layers. Weight only and dynamic activation quantization for various dtypes and sparse layouts can be chosen using our top level quantize\_ api +```py from torchao.quantization import ( quantize\_, int4\_weight\_only, ) quantize\_(model, int4\_weight\_only()) +``` Sometimes quantizing a layer can make it slower because of overhead so if you’d rather we just pick how to quantize each layer in a model for you then you can instead run +```py model \= torchao.autoquant(torch.compile(model, mode='max-autotune')) +``` quantize\_ API has a few different options depending on whether your model is compute bound or memory bound. From 542980df018a09ef35dd697a5e326cd22322b463 Mon Sep 17 00:00:00 2001 From: Andrew Bringaze Linux Foundation Date: Fri, 27 Sep 2024 11:23:55 -0500 Subject: [PATCH 4/4] code comment fix --- .../2024-09-26-pytorch-native-architecture-optimization.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_posts/2024-09-26-pytorch-native-architecture-optimization.md b/_posts/2024-09-26-pytorch-native-architecture-optimization.md index 3a831f5d867c..dd901a5a2517 100644 --- a/_posts/2024-09-26-pytorch-native-architecture-optimization.md +++ b/_posts/2024-09-26-pytorch-native-architecture-optimization.md @@ -50,15 +50,15 @@ quantize\_ API has a few different options depending on whether your model is co ```py from torchao.quantization import ( - \# Memory bound models + # Memory bound models int4\_weight\_only, int8\_weight\_only, - \# Compute bound models + # Compute bound models int8\_dynamic\_activation\_int8\_semi\_sparse\_weight, int8\_dynamic\_activation\_int8\_weight, - \# Device capability 8.9+ + # Device capability 8.9+ float8\_weight\_only, float8\_dynamic\_activation\_float8\_weight, )