@@ -34,6 +34,12 @@ models with Triton Inference Server. The [inflight_batcher_llm](./inflight_batch
3434directory contains the C++ implementation of the backend supporting inflight
3535batching, paged attention and more.
3636
37+ > [ !NOTE]
38+ >
39+ > Please note that the Triton backend source code and test have been moved
40+ > to [ TensorRT-LLM] ( https://github.com/NVIDIA/TensorRT-LLM ) under the
41+ > ` triton_backend ` directory.
42+
3743Where can I ask general questions about Triton and Triton backends?
3844Be sure to read all the information below as well as the [ general
3945Triton documentation] ( https://github.com/triton-inference-server/server#triton-inference-server )
@@ -156,14 +162,14 @@ more details on the parameters.
156162Next, create the
157163[ model repository] ( https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md )
158164that will be used by the Triton server. The models can be found in the
159- [ all_models] ( ./all_models ) folder. The folder contains two groups of models:
160- - [ ` gpt ` ] ( ./all_models/gpt ) : Using TensorRT-LLM pure Python runtime.
161- - [ ` inflight_batcher_llm ` ] ( ./all_models/inflight_batcher_llm/ ) `: Using the C++
165+ [ all_models] ( ./tensorrt_llm/triton_backend/ all_models ) folder. The folder contains two groups of models:
166+ - [ ` gpt ` ] ( ./tensorrt_llm/triton_backend/ all_models/gpt ) : Using TensorRT-LLM pure Python runtime.
167+ - [ ` inflight_batcher_llm ` ] ( ./tensorrt_llm/triton_backend/ all_models/inflight_batcher_llm/ ) `: Using the C++
162168TensorRT-LLM backend with the executor API, which includes the latest features
163169including inflight batching.
164170
165171There are five models in
166- [ all_models/inflight_batcher_llm] ( ./all_models/inflight_batcher_llm ) that will
172+ [ all_models/inflight_batcher_llm] ( ./tensorrt_llm/triton_backend/ all_models/inflight_batcher_llm ) that will
167173be used in this example:
168174
169175| Model | Description |
@@ -291,11 +297,11 @@ Which should return a result similar to (formatted for readability):
291297##### Using the client scripts
292298
293299You can refer to the client scripts in the
294- [ inflight_batcher_llm/client] ( ./inflight_batcher_llm/client ) to see how to send
300+ [ inflight_batcher_llm/client] ( ./tensorrt_llm/triton_backend/ inflight_batcher_llm/client ) to see how to send
295301requests via Python scripts.
296302
297303Below is an example of using
298- [ inflight_batcher_llm_client] ( ./inflight_batcher_llm/client/inflight_batcher_llm_client.py )
304+ [ inflight_batcher_llm_client] ( ./tensorrt_llm/triton_backend/ inflight_batcher_llm/client/inflight_batcher_llm_client.py )
299305to send requests to the ` tensorrt_llm ` model.
300306
301307``` bash
356362After launching the server, you could get the output of logits by passing the
357363corresponding parameters `--return-context-logits` and/or
358364`--return-generation-logits` in the client scripts
359- ([end_to_end_grpc_client.py](./inflight_batcher_llm/client/end_to_end_grpc_client.py)
365+ ([end_to_end_grpc_client.py](./tensorrt_llm/triton_backend/ inflight_batcher_llm/client/end_to_end_grpc_client.py)
360366and
361- [inflight_batcher_llm_client.py](./inflight_batcher_llm/client/inflight_batcher_llm_client.py)).
367+ [inflight_batcher_llm_client.py](./tensorrt_llm/triton_backend/ inflight_batcher_llm/client/inflight_batcher_llm_client.py)).
362368
363369For example:
364370
@@ -413,7 +419,7 @@ with a given batch index. An output tensor named `batch_index` is associated
413419with each response to indicate which batch index this response corresponds to.
414420
415421The client script
416- [end_to_end_grpc_client.py](./inflight_batcher_llm/client/end_to_end_grpc_client.py)
422+ [end_to_end_grpc_client.py](./tensorrt_llm/triton_backend/ inflight_batcher_llm/client/end_to_end_grpc_client.py)
417423demonstrates how a client can send requests with batch size > 1 and consume the
418424responses returned from Triton. When passing `--batch-inputs` to the client
419425script, the client will create a request with multiple prompts, and use the
0 commit comments