You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> Currently only the basic types + Q4/Q5/Q8 are implemented. K-quants are **not** supported.
20
-
21
-
NOTE: The other backends may have different support.
22
-
23
-
| Quant / Type | CUDA | Vulkan |
24
-
|--------------|------|--------|
25
-
| F32 | ✔️ | ✔️ |
26
-
| F16 | ✔️ | ✔️ |
27
-
| BF16 | ✔️ | ✔️ |
28
-
| I32 | ✔️ | ❌ |
29
-
| Q4_0 | ✔️ | ✔️ |
30
-
| Q4_1 | ✔️ | ✔️ |
31
-
| Q5_0 | ✔️ | ✔️ |
32
-
| Q5_1 | ✔️ | ✔️ |
33
-
| Q8_0 | ✔️ | ✔️ |
34
-
| Q2_K | ❌ | ❌ |
35
-
| Q3_K | ❌ | ❌ |
36
-
| Q4_K | ❌ | ❌ |
37
-
| Q5_K | ❌ | ❌ |
38
-
| Q6_K | ❌ | ❌ |
39
-
| Q8_K | ❌ | ❌ |
40
-
| IQ1_S | ❌ | ✔️ |
41
-
| IQ1_M | ❌ | ✔️ |
42
-
| IQ2_XXS | ❌ | ✔️ |
43
-
| IQ2_XS | ❌ | ✔️ |
44
-
| IQ2_S | ❌ | ✔️ |
45
-
| IQ3_XXS | ❌ | ✔️ |
46
-
| IQ3_S | ❌ | ✔️ |
47
-
| IQ4_XS | ❌ | ✔️ |
48
-
| IQ4_NL | ❌ | ✔️ |
49
-
| MXFP4 | ❌ | ✔️ |
15
+
# Lora Apply Mode
16
+
17
+
There are two ways to apply LoRA: **immediately** and **at_runtime**. You can specify it using the `--lora-apply-mode` parameter.
18
+
19
+
By default, the mode is selected automatically:
20
+
21
+
* If the model weights contain any quantized parameters, the **at_runtime** mode is used;
22
+
* Otherwise, the **immediately** mode is used.
23
+
24
+
The **immediately** mode may have precision and compatibility issues with quantized parameters, but it usually offers faster inference speed and, in some cases, lower memory usage.
25
+
In contrast, the **at_runtime** mode provides better compatibility and higher precision, but inference may be slower and memory usage may be higher in some cases.
0 commit comments