Skip to content

Commit 2936d2e

Browse files
fix image source (#1730)
1 parent fd7e1a3 commit 2936d2e

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

_posts/2024-09-25-pytorch-native-architecture-optimizaion.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -61,20 +61,20 @@ from torchao.quantization import (
6161
float8\_dynamic\_activation\_float8\_weight,
6262
)
6363

64-
![](/assets/images/hopper-tma-unit/Figure_1.png){:style="width:100%"}
64+
![](/assets/images/Figure_1.png){:style="width:100%"}
6565

6666
We also have extensive benchmarks on diffusion models in collaboration with the HuggingFace diffusers team in [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao.) where we demonstrated 53.88% speedup on Flux.1-Dev and 27.33% speedup on CogVideoX-5b
6767

6868
Our APIs are composable so we’ve for example composed sparsity and quantization to bring 5% [speedup for ViT-H inference](https://github.com/pytorch/ao/tree/main/torchao/sparsity)
6969

7070
But also can do things like quantize weights to int4 and the kv cache to int8 to support [Llama 3.1 8B at the full 128K context length running in under 18.9GB of VRAM](https://github.com/pytorch/ao/pull/738).
71-
![](/assets/images/hopper-tma-unit/Figure_2.png){:style="width:100%"}
71+
![](/assets/images/Figure_2.png){:style="width:100%"}
7272

7373
## QAT
7474

7575
Post training quantization, especially at less than 4 bit can suffer from serious accuracy degradations. Using [Quantization Aware Training](https://pytorch.org/blog/quantization-aware-training/) (QAT) we’ve managed to recover up to 96% of the accuracy degradation on hellaswag. We’ve integrated this as an end to end recipe in torchtune with a minimal [tutorial](https://github.com/pytorch/ao/tree/main/torchao/quantization/prototype/qat)
7676

77-
![](/assets/images/hopper-tma-unit/Figure_3.png){:style="width:100%"}
77+
![](/assets/assets/Figure_3.png){:style="width:100%"}
7878

7979
# Training
8080

@@ -89,7 +89,7 @@ For an e2e example of how to speed up LLaMa 3 70B pretraining by up to **1.5x**
8989

9090
### Performance and accuracy of float8 pretraining of LLaMa 3 70B, vs bfloat16
9191

92-
![](/assets/images/hopper-tma-unit/Figure_4.png){:style="width:100%"}
92+
![](/assets/images/Figure_4.png){:style="width:100%"}
9393
(source: [https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359](https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359))
9494

9595
We are expanding our training workflows to more dtypes and layouts
@@ -104,7 +104,7 @@ Inspired by Bits and Bytes we’ve also added prototype support for 8 and 4 bit
104104

105105
from torchao.prototype.low\_bit\_optim import AdamW8bit, AdamW4bit
106106
optim \= AdamW8bit(model.parameters())
107-
![](/assets/images/hopper-tma-unit/Figure_5.png){:style="width:100%"}
107+
![](/assets/images/Figure_5.png){:style="width:100%"}
108108

109109
# Integrations
110110

0 commit comments

Comments
 (0)