Skip to content

Commit 6a749f5

Browse files
authored
Remove backend files and update documentation (#745)
* Remove backend files * Update documentation * Update documentation * Add a note about build.sh not working Signed-off-by: Iman Tabrizian <[email protected]> --------- Signed-off-by: Iman Tabrizian <[email protected]>
1 parent c2e65b8 commit 6a749f5

File tree

109 files changed

+163
-27922
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

109 files changed

+163
-27922
lines changed

README.md

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,12 @@ models with Triton Inference Server. The [inflight_batcher_llm](./inflight_batch
3434
directory contains the C++ implementation of the backend supporting inflight
3535
batching, paged attention and more.
3636

37+
> [!NOTE]
38+
>
39+
> Please note that the Triton backend source code and test have been moved
40+
> to [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) under the
41+
> `triton_backend` directory.
42+
3743
Where can I ask general questions about Triton and Triton backends?
3844
Be sure to read all the information below as well as the [general
3945
Triton documentation](https://github.com/triton-inference-server/server#triton-inference-server)
@@ -156,14 +162,14 @@ more details on the parameters.
156162
Next, create the
157163
[model repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md)
158164
that will be used by the Triton server. The models can be found in the
159-
[all_models](./all_models) folder. The folder contains two groups of models:
160-
- [`gpt`](./all_models/gpt): Using TensorRT-LLM pure Python runtime.
161-
- [`inflight_batcher_llm`](./all_models/inflight_batcher_llm/)`: Using the C++
165+
[all_models](./tensorrt_llm/triton_backend/all_models) folder. The folder contains two groups of models:
166+
- [`gpt`](./tensorrt_llm/triton_backend/all_models/gpt): Using TensorRT-LLM pure Python runtime.
167+
- [`inflight_batcher_llm`](./tensorrt_llm/triton_backend/all_models/inflight_batcher_llm/)`: Using the C++
162168
TensorRT-LLM backend with the executor API, which includes the latest features
163169
including inflight batching.
164170

165171
There are five models in
166-
[all_models/inflight_batcher_llm](./all_models/inflight_batcher_llm) that will
172+
[all_models/inflight_batcher_llm](./tensorrt_llm/triton_backend/all_models/inflight_batcher_llm) that will
167173
be used in this example:
168174

169175
| Model | Description |
@@ -291,11 +297,11 @@ Which should return a result similar to (formatted for readability):
291297
##### Using the client scripts
292298

293299
You can refer to the client scripts in the
294-
[inflight_batcher_llm/client](./inflight_batcher_llm/client) to see how to send
300+
[inflight_batcher_llm/client](./tensorrt_llm/triton_backend/inflight_batcher_llm/client) to see how to send
295301
requests via Python scripts.
296302

297303
Below is an example of using
298-
[inflight_batcher_llm_client](./inflight_batcher_llm/client/inflight_batcher_llm_client.py)
304+
[inflight_batcher_llm_client](./tensorrt_llm/triton_backend/inflight_batcher_llm/client/inflight_batcher_llm_client.py)
299305
to send requests to the `tensorrt_llm` model.
300306

301307
```bash
@@ -356,9 +362,9 @@ or
356362
After launching the server, you could get the output of logits by passing the
357363
corresponding parameters `--return-context-logits` and/or
358364
`--return-generation-logits` in the client scripts
359-
([end_to_end_grpc_client.py](./inflight_batcher_llm/client/end_to_end_grpc_client.py)
365+
([end_to_end_grpc_client.py](./tensorrt_llm/triton_backend/inflight_batcher_llm/client/end_to_end_grpc_client.py)
360366
and
361-
[inflight_batcher_llm_client.py](./inflight_batcher_llm/client/inflight_batcher_llm_client.py)).
367+
[inflight_batcher_llm_client.py](./tensorrt_llm/triton_backend/inflight_batcher_llm/client/inflight_batcher_llm_client.py)).
362368
363369
For example:
364370
@@ -413,7 +419,7 @@ with a given batch index. An output tensor named `batch_index` is associated
413419
with each response to indicate which batch index this response corresponds to.
414420
415421
The client script
416-
[end_to_end_grpc_client.py](./inflight_batcher_llm/client/end_to_end_grpc_client.py)
422+
[end_to_end_grpc_client.py](./tensorrt_llm/triton_backend/inflight_batcher_llm/client/end_to_end_grpc_client.py)
417423
demonstrates how a client can send requests with batch size > 1 and consume the
418424
responses returned from Triton. When passing `--batch-inputs` to the client
419425
script, the client will create a request with multiple prompts, and use the

all_models/disaggregated_serving/README.md

Lines changed: 0 additions & 123 deletions
This file was deleted.

all_models/disaggregated_serving/disaggregated_serving_bls/1/model.py

Lines changed: 0 additions & 138 deletions
This file was deleted.

0 commit comments

Comments
 (0)