You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`execute` function is called whenever an inference request is made. Every Python
277
279
model must implement `execute` function. In the `execute` function you are given
278
-
a list of `InferenceRequest` objects. In this function, your `execute` function
279
-
must return a list of `InferenceResponse` objects that has the same length as
280
-
`requests`.
280
+
a list of `InferenceRequest` objects. There are two modes of implementing this
281
+
function. The mode you choose should depend on your use case. That is whether
282
+
or not you want to return decoupled responses from this model or not.
281
283
282
-
In case one of the inputs has an error, you can use the `TritonError` object
284
+
#### Default Mode
285
+
286
+
This is the most generic way you would like to implement your model and
287
+
requires the `execute` function to return exactly one response per request.
288
+
This entails that in this mode, your `execute` function must return a list of
289
+
`InferenceResponse` objects that has the same length as `requests`. The work
290
+
flow in this mode is:
291
+
292
+
*`execute` function receives a batch of pb_utils.InferenceRequest as a
293
+
length N array.
294
+
295
+
* Perform inference on the pb_utils.InferenceRequest and append the
296
+
corresponding pb_utils.InferenceResponse to a response list.
297
+
298
+
* Return back the response list.
299
+
300
+
* The length of response list being returned must be N.
301
+
302
+
* Each element in the list should be the response for the corresponding
303
+
element in the request array.
304
+
305
+
* Each element must contain a response (a response can be either output
306
+
tensors or an error); an element cannot be None.
307
+
308
+
309
+
Triton checks to ensure that these requirements on response list are
310
+
satisfied and if not returns an error response for all inference requests.
311
+
Upon return from the execute function all tensor data associated with the
312
+
InferenceRequest objects passed to the function are deleted, and so
313
+
InferenceRequest objects should not be retained by the Python model.
314
+
315
+
In case one of the requests has an error, you can use the `TritonError` object
283
316
to set the error message for that specific request. Below is an example of
284
317
setting errors for an `InferenceResponse` object:
285
318
@@ -302,6 +335,79 @@ class TritonPythonModel:
302
335
return responses
303
336
```
304
337
338
+
339
+
#### Decoupled mode \[Beta\]
340
+
341
+
This mode allows user to send multiple responses for a request or
342
+
not send any responses for a request. A model may also send
343
+
responses out-of-order relative to the order that the request batches
344
+
are executed. Such models are called *decoupled* models. In
345
+
order to use this mode, the [transaction policy](https://github.com/triton-inference-server/server/docs/model_configuration.md#model-transaction-policy)
346
+
in the model configuration must be set to decoupled.
347
+
348
+
349
+
In decoupled mode, model must use `InferenceResponseSender` object per
350
+
request to keep creating and sending any number of responses for the
351
+
request. The workflow in this mode may look like:
352
+
353
+
*`execute` function receives a batch of pb_utils.InferenceRequest as a
354
+
length N array.
355
+
356
+
* Iterate through each pb_utils.InferenceRequest and perform for the following
357
+
steps for each pb_utils.InferenceRequest object:
358
+
359
+
1. Get `InferenceResponseSender` object for the InferenceRequest using
360
+
InferenceRequest.get_response_sender().
361
+
362
+
2. Create and populate pb_utils.InferenceResponse to be sent back.
363
+
364
+
3. Use InferenceResponseSender.send() to send the above response. If
365
+
this is the last request then pass pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
366
+
as a flag with InferenceResponseSender.send(). Otherwise continue with
367
+
Step 1 for sending next request.
368
+
369
+
* The return value for `execute` function in this mode should be None.
370
+
371
+
Similar to above, in case one of the requests has an error, you can use
372
+
the `TritonError` object to set the error message for that specific
373
+
request. After setting errors for an pb_utils.InferenceResponse
374
+
object, use InferenceResponseSender.send() to send response with the
375
+
error back to the user.
376
+
377
+
##### Use Cases
378
+
379
+
The decoupled mode is powerful and supports various other use cases:
380
+
381
+
* If the model should not send any response for the request,
382
+
then call InferenceResponseSender.send() with no response
383
+
but flag parameter set to pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL.
384
+
385
+
* The model can also send responses out-of-order in which it received
386
+
requests.
387
+
388
+
* The request data and `InferenceResponseSender` object can be passed to
389
+
a separate thread in the model. This means main caller thread can exit
390
+
from `execute` function and the model can still continue generating
391
+
responses as long as it holds `InferenceResponseSender` object.
392
+
393
+
394
+
The [decoupled examples](examples/decoupled/README.md) demonstrate
395
+
full power of what can be acheived from decoupled API. Read
396
+
[Decoupled Backends and Models](https://github.com/triton-inference-server/server/blob/main/docs/decoupled_models.md)
397
+
for more details on how to host a decoupled model.
398
+
399
+
##### Known Issues
400
+
401
+
The support for decoupled models is still in beta and suffers
402
+
from below known issues:
403
+
404
+
* The decoupled mode doesn't support [FORCE_CPU_ONLY_INPUT_TENSORS](#input-tensor-device-placement)
405
+
parameter to be turned off. This means that the input tensors
406
+
will always be in CPU.
407
+
* Currently, the InferenceResponseSender.send method only supports
408
+
inference_response objects that contain only CPU tensors.
409
+
* The metrics collection may be incomplete.
410
+
305
411
### `finalize`
306
412
307
413
Implementing `finalize` is optional. This function allows you to do any clean
@@ -722,6 +828,9 @@ do not create a circular dependency. For example, if model A performs an inferen
722
828
on itself and there are no more model instances ready to execute the inference request, the
723
829
model will block on the inference execution forever.
724
830
831
+
- Currently, BLS can not run inference on a decoupled model.
832
+
833
+
725
834
# Interoperability and GPU Support
726
835
727
836
Starting from 21.09 release, Python backend supports
@@ -823,6 +932,12 @@ You can find the complete example instructions in [examples/bls](examples/bls/RE
823
932
The Preprocessing example shows how to use Python Backend to do model preprocessing.
824
933
You can find the complete example instructions in [examples/preprocessing](examples/preprocessing/README.md).
825
934
935
+
## Decoupled Models
936
+
937
+
The examples of decoupled models shows how to develop and serve
938
+
[decoupled models](../../README.md#decoupled-mode-beta) in Triton using Python backend.
939
+
You can find the complete example instructions in [examples/decoupled](examples/decoupled/README.md).
940
+
826
941
# Running with Inferentia
827
942
828
943
Please see the [README.md](https://github.com/triton-inference-server/python_backend/tree/main/inferentia/README.md) located in the python_backend/inferentia sub folder.
0 commit comments