Releases: huggingface/text-generation-inference
Releases · huggingface/text-generation-inference
v3.3.6
What's Changed
- Add missing backslash by @philsupertramp in #3311
- Revert "feat: bump flake including transformers and huggingface_hub versions" by @drbh in #3323
- fix: remove azure by @drbh in #3325
- Fix mask passed to flashinfer by @danieldk in #3324
- Update iframe sources for streaming demo by @coyotte508 in #3327
- Revert "Revert "feat: bump flake including transformers and huggingfa… by @drbh in #3326
- Revert "feat: bump flake including transformers and huggingface_hub versions" by @drbh in #3330
- Patch version 3.3.6 by @tengomucho in #3329
New Contributors
- @philsupertramp made their first contribution in #3311
- @coyotte508 made their first contribution in #3327
Full Changelog: v3.3.5...v3.3.6
v3.3.5
What's Changed
- [gaudi] Refine rope memory, do not need to keep sin/cos cache per layer by @sywangyi in #3274
- Gaudi: add CI by @baptistecolle in #3160
- [gaudi] Gemma3 sliding window support by @sywangyi in #3280
- xpu lora support by @sywangyi in #3232
- Optimum neuron 0.2.2 by @dacorvo in #3281
- [gaudi] Remove unnecessary reinitialize to HeterogeneousNextTokenChooser to m… by @sywangyi in #3284
- [gaudi] Deepseek v2 mla and add ep to unquantized moe by @sywangyi in #3287
- [gaudi] Fix the CI test errors by @yuanwu2017 in #3286
- Hpu gptq gidx support by @sywangyi in #3297
- Migrate to V2 Pydantic interface by @emmanuel-ferdman in #3262
- Xccl by @sywangyi in #3252
- Multi modality fix by @sywangyi in #3283
- some gptq case could not be handled by ipex. but could be handle by t… by @sywangyi in #3298
- fix outline import issue by @sywangyi in #3282
- HuggingFaceM4/Idefics3-8B-Llama3 crash fix by @sywangyi in #3267
- Optimum neuron 0.3.0 by @tengomucho in #3308
- Disable Cachix pushes by @danieldk in #3312
- chore: prepare version 3.3.5 by @tengomucho in #3314
- feat: bump flake including transformers and huggingface_hub versions by @drbh in #3313
Full Changelog: v3.3.4...git
v3.3.4
v3.3.3
Neuron backend update.
What's Changed
- Remove useless packages by @yuanwu2017 in #3253
- Bump neuron SDK version by @dacorvo in #3260
- Perf opt by @sywangyi in #3256
- [gaudi] Vlm rebase and issue fix in benchmark test by @sywangyi in #3263
- Move the _update_cos_sin_cache into get_cos_sin by @yuanwu2017 in #3254
- [Gaudi] Remove optimum-habana by @yuanwu2017 in #3261
- [gaudi] HuggingFaceM4/idefics2-8b issue fix by @sywangyi in #3264
- [Gaudi] Enable Qwen3_moe model by @yuanwu2017 in #3244
- [Gaudi]Fix the integration-test issues by @yuanwu2017 in #3265
- [Gaudi] use pad_token_id to pad input id by @sywangyi in #3268
- chore: prepare release 3.3.3 by @dacorvo in #3269
- [gaudi] Refine logging for Gaudi warmup by @regisss in #3222
- doc: fix README by @dacorvo in #3271
Full Changelog: v3.3.2...v3.3.3
v3.3.2
Gaudi improvements.
What's Changed
- upgrade to new vllm extension ops(fix issue in exponential bucketing) by @sywangyi in #3239
- Nix: switch to hf-nix by @danieldk in #3240
- Add Qwen3 by @yuanwu2017 in #3229
- fp8 compressed_tensors w8a8 support by @sywangyi in #3242
- [Gaudi] Fix the OOM issue of Llama-4-Scout-17B-16E-Instruct by @yuanwu2017 in #3245
- Fix the Llama-4-Maverick-17B-128E crash issue by @yuanwu2017 in #3246
- Prepare for 3.3.2 by @danieldk in #3249
Full Changelog: v3.3.1...v3.3.2
v3.3.1
This release updates TGI to Torch 2.7 and CUDA 12.8.
What's Changed
- change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #3217
- adjust the
round_up_seq
logic to align with prefill warmup phase on… by @kaixuanliu in #3224 - Update to Torch 2.7.0 by @danieldk in #3221
- Enable Llama4 for gaudi backend by @yuanwu2017 in #3223
- fix: count gpu uuids if NVIDIA_VISIBLE_DEVICES env set to all by @drbh in #3230
- Deepseek r1 by @sywangyi in #3211
- Refine warmup and upgrade to synapse AI 1.21.0 by @sywangyi in #3234
- fix the crash in default ATTENTION path by @sywangyi in #3235
- Switch to punica-sgmv kernel from the Hub by @danieldk in #3236
- move input_ids to hpu and remove disposal of adapter_meta by @sywangyi in #3237
- Prepare for 3.3.1 by @danieldk in #3238
New Contributors
- @kaixuanliu made their first contribution in #3217
Full Changelog: v3.3.0...v3.3.1
v3.3.0
Notable changes
- Prefill chunking for VLMs.
What's Changed
- Fixing Qwen 2.5 VL (32B). by @Narsil in #3157
- Fixing tokenization like https://github.com/huggingface/text-embeddin… by @Narsil in #3156
- Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu by @sywangyi in #3113
- L4 fixes by @mht-sharma in #3161
- setuptools <= 70.0 is vulnerable: CVE-2024-6345 by @Narsil in #3171
- transformers flash llm/vlm enabling in ipex by @sywangyi in #3152
- Upgrading the dependencies in Gaudi backend. by @Narsil in #3170
- Hotfixing gaudi deps. by @Narsil in #3174
- Hotfix gaudi2 with newer transformers. by @Narsil in #3176
- Support flashinfer for Gemma3 prefill by @danieldk in #3167
- Get opentelemetry trace id from request headers instead of creating a new trace by @kozistr in #2648
- Bump
sccache
to 0.10.0 by @alvarobartt in #3179 - Fixing CI by @Narsil in #3184
- Add option to configure prometheus port by @mht-sharma in #3187
- Warmup gaudi backend by @sywangyi in #3172
- Put more wiggle room. by @Narsil in #3189
- Fixing the router + template for Qwen3. by @Narsil in #3200
- Skip
{% generation %}
and{% endgeneration %}
template handling by @alvarobartt in #3204 - doc typo by @julien-c in #3206
- Pr 2982 ci branch by @drbh in #3046
- fix: bump snaps for mllama by @drbh in #3202
- Update client SDK snippets by @julien-c in #3207
- Fix
HF_HUB_OFFLINE=1
for Gaudi backend by @regisss in #3193 - IPEX support FP8 kvcache/softcap/slidingwindow by @sywangyi in #3144
- forward and tokenize chooser use the same shape by @sywangyi in #3196
- Chunked Prefill VLM by @mht-sharma in #3188
- Prepare for 3.3.0 by @danieldk in #3220
New Contributors
Full Changelog: v3.2.3...v3.3.0
v3.2.3
Main changes
- Patching Llama 4
What's Changed
- Use ROCM 6.3.1 by @mht-sharma in #3141
- Update transformers to 4.51 by @mht-sharma in #3148
- Gaudi: Add Integration Test for Gaudi Backend by @baptistecolle in #3142
- fix: compute type typo by @oOraph in #3150
- 3.2.3 by @Narsil in #3151
Full Changelog: v3.2.2...v3.2.3
v3.2.2
What's Changed
- Minor fixes. by @Narsil in #3125
- configurable termination timeout by @ErikKaum in #3126
- CI: enable server tests for backends by @baptistecolle in #3128
- Torch 2.6 by @Narsil in #3134
- Gaudi: Fix llava-next and mllama crash issue by @yuanwu2017 in #3127
- nix-v3.2.1 -> v3.2.1-nix by @co42 in #3129
- Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE by @yuanwu2017 in #3131
- Add llama4 by @mht-sharma in #3145
- Preparing for release. by @Narsil in #3147
New Contributors
Full Changelog: v3.2.1...v3.2.2
v3.2.1
What's Changed
- Update to
kernels
0.2.1 by @danieldk in #3084 - Router: add
gemma3-text
model type by @danieldk in #3107 - We need gcc during runtime to enable triton to compile kernels. by @Narsil in #3103
- Release of Gaudi Backend for TGI by @baptistecolle in #3091
- Fixing the docker build. by @Narsil in #3108
- Make the Nix-based Docker container work on non-NixOS by @danieldk in #3109
- xpu 2.6 update by @sywangyi in #3051
- launcher: correctly get the head dimension for VLMs by @danieldk in #3116
- Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork by @baptistecolle in #3117
- Bug Fix: Sliding Window Attention by @mht-sharma in #3112
- Publish nix docker image. by @Narsil in #3122
- Prepare for patch release. by @Narsil in #3124
- Intel docker. by @Narsil in #3121
Full Changelog: v3.2.0...v3.2.1