Open
Description
When using a gguf format model, the number of tokens generated per second rapidly decreases as the context length increases. In contrast, mlx format models do not have this issue. My machine is an M3 Ultra. The specific impact is that during the inference process, the speed gradually slows down and eventually becomes unacceptable.