yanyusong
diff --git a/‎README.md‎
Lines changed: 58 additions & 5 deletions b/‎README.md‎
Lines changed: 58 additions & 5 deletions
diff --git a/‎examples/bls/README.md‎
Lines changed: 74 additions & 21 deletions b/‎examples/bls/README.md‎
Lines changed: 74 additions & 21 deletions
diff --git a/‎examples/bls/async_client.py‎
Lines changed: 60 additions & 0 deletions b/‎examples/bls/async_client.py‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎examples/bls/async_config.pbtxt‎
Lines changed: 59 additions & 0 deletions b/‎examples/bls/async_config.pbtxt‎
Lines changed: 59 additions & 0 deletions
@@ -363,8 +363,9 @@ above. However, it is important to see `libpython3.6m.so.1.0` in the list of
 linked shared libraries. If you use a different Python version, you should see
 that version instead. You need to copy the `triton_python_backend_stub` to the
 model directory of the models that want to use the custom Python backend
-stub. For example, if you have `model_a` in your [model repository](https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md), the folder
-structure should look like below:
+stub. For example, if you have `model_a` in your
+[model repository](https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md),
+the folder structure should look like below:
 
 ```
 models
@@ -537,10 +538,62 @@ class TritonPythonModel:
 
           # Decide the next steps for model execution based on the received output
           # tensors. It is possible to use the same output tensors to for the final
-          # inference resposne too.
+          # inference response too.
 ```
 
-A complete example for BLS in Python backend is included in the
+
+In addition to the `inference_request.exec` function that allows you to
+execute blocking inference requests, `inference_request.async_exec` allows
+you to perform async inference requests. This can be useful when you do not
+need the result of the inference immediately. Using `async_exec` function, it
+is possible to have multiple inflight inference requests and wait for the
+responses only when needed. Example below shows how to use `async_exec`:
+
+```python
+import triton_python_backend_utils as pb_utils
+import asyncio
+
+
+class TritonPythonModel:
+  ...
+
+    # You must add the Python 'async' keyword to the beginning of `execute`
+    # function if you want to use `async_exec` function.
+    async def execute(self, requests):
+      ...
+      # Create an InferenceRequest object. `model_name`,
+      # `requested_output_names`, and `inputs` are the required arguments and
+      # must be provided when constructing an InferenceRequest object. Make sure
+      # to replace `inputs` argument with a list of `pb_utils.Tensor` objects.
+      inference_request = pb_utils.InferenceRequest(
+          model_name='model_name',
+          requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'],
+          inputs=[<pb_utils.Tensor object>])
+
+      infer_response_awaits = []
+      for i in range(4):
+        # async_exec function returns an
+        # [Awaitable](https://docs.python.org/3/library/asyncio-task.html#awaitables)
+        # object.
+        inference_response_awaits.append(inference_request.async_exec())
+
+      # Wait for all of the inference requests to complete.
+      infer_responses = await asyncio.gather(*infer_response_awaits)
+
+      for infer_response in infer_responses:
+        # Check if the inference response has an error
+        if inference_response.has_error():
+            raise pb_utils.TritonModelException(inference_response.error().message())
+        else:
+            # Extract the output tensors from the inference response.
+            output1 = pb_utils.get_output_tensor_by_name(inference_response, 'REQUESTED_OUTPUT_1')
+            output2 = pb_utils.get_output_tensor_by_name(inference_response, 'REQUESTED_OUTPUT_2')
+
+            # Decide the next steps for model execution based on the received output
+            # tensors.
+```
+
+A complete example for sync and async BLS in Python backend is included in the
 [Examples](#examples) section.
 
 ## Limitations
@@ -561,7 +614,7 @@ For using the Triton Python client in these examples you need to install
 the [Triton Python Client Library](https://github.com/triton-inference-server/client#getting-the-client-libraries-and-examples).
 The Python client for each of the examples is in the `client.py` file.
 
-## AddSub in Numpy
+## AddSub in NumPy
 
 There is no dependencies required for the AddSub numpy example. Instructions
 on how to use this model is explained in the quick start section. You can
 
@@ -28,32 +28,38 @@
 
 # BLS Example
 
-In this example we demonstrate an end-to-end example for
+In this section we demonstrate an end-to-end example for
 [BLS](../../README.md#business-logic-scripting-beta) in Python backend. The
 [model repository](https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md)
-should contain [PyTorch](../pytorch), [AddSub](../add_sub), and [BLS](../bls) models.
-The [PyTorch](../pytorch) and [AddSub](../add_sub) models
-calculate the sum and difference of the `INPUT0` and `INPUT1` and put the
-results in `OUTPUT0` and `OUTPUT1` respectively. The goal of the BLS model is
-the same as [PyTorch](../pytorch) and [AddSub](../add_sub) models but the
-difference is that the BLS model will not calculate the sum and difference by
-itself. The BLS model will pass the input tensors to the [PyTorch](../pytorch)
-or [AddSub](../add_sub) models and return the responses of that model as the
-final response. The additional parameter `MODEL_NAME` determines which model
-will be used for calculating the final outputs.
+should contain [pytorch](../pytorch), [addsub](../add_sub).  The
+[pytorch](../pytorch) and [addsub](../add_sub) models calculate the sum and
+difference of the `INPUT0` and `INPUT1` and put the results in `OUTPUT0` and
+`OUTPUT1` respectively. This example is broken into two sections. The first
+section demonstrates how to perform synchronous BLS requests and the second
+section shows how to execute asynchronous BLS requests.
+
+## Synchronous BLS Requests
+
+The goal of sync BLS model is the same as [pytorch](../pytorch) and
+[addsub](../add_sub) models but the difference is that the BLS model will not
+calculate the sum and difference by itself. The sync BLS model will pass the
+input tensors to the [pytorch](../pytorch) or [addsub](../add_sub) models and
+return the responses of that model as the final response. The additional
+parameter `MODEL_NAME` determines which model will be used for calculating the
+final outputs.
 
 1. Create the model repository:
 
 ```console
 $ mkdir -p models/add_sub/1
-$ mkdir -p models/bls/1
+$ mkdir -p models/bls_sync/1
 $ mkdir -p models/pytorch/1
 
 # Copy the Python models
 $ cp examples/add_sub/model.py models/add_sub/1/
-$ cp examples/add_sub/config.pbtxt models/add_sub/
-$ cp examples/bls/model.py models/bls/1/
-$ cp examples/bls/config.pbtxt models/bls/
+$ cp examples/add_sub/config.pbtxt models/add_sub/config.pbtxt
+$ cp examples/bls/sync_model.py models/bls_sync/1/model.py
+$ cp examples/bls/sync_config.pbtxt models/bls_sync/config.pbtxt
 $ cp examples/pytorch/model.py models/pytorch/1/
 $ cp examples/pytorch/config.pbtxt models/pytorch/
 ```
@@ -67,7 +73,7 @@ tritonserver --model-repository `pwd`/models
 3. Send inference requests to server:
 
 ```
-python3 examples/bls/client.py
+python3 examples/bls/sync_client.py
 ```
 
 You should see an output similar to the output below:
@@ -90,15 +96,62 @@ At:
   /tmp/python_backend/models/bls/1/model.py(110): execute
 ```
 
-The [bls](./model.py) model file is heavily commented with explanations about
-each of the function calls.
+The [sync_model.py](./sync_model.py) model file is heavily commented with
+explanations about each of the function calls.
 
-## Explanation of the Client Output
+### Explanation of the Client Output
 
-The [client.py](./client.py) sends three inference requests to the 'bls'
+The [client.py](./sync_client.py) sends three inference requests to the 'bls_sync'
 model with different values for the "MODEL_NAME" input. As explained earlier,
 "MODEL_NAME" determines the model name that the "bls" model will use for
 calculating the final outputs. In the first request, it will use the "add_sub"
-model and in the seceond request it will use the "pytorch" model. The third
+model and in the second request it will use the "pytorch" model. The third
 request uses an incorrect model name to demonstrate error handling during
 the inference request execution.
+
+## Asynchronous BLS Requests
+
+In this section we explain how to send multiple BLS requests without waiting for
+their response. Asynchronous execution of BLS requests will not block your
+model execution and can lead to speedups under certain conditions.
+
+The `bls_async` model will perform two async BLS requests on the
+[pytorch](../pytorch) and [addsub](../add_sub) models. Then, it will wait until
+the inference requests on these models is completed. It will extract `OUTPUT0`
+from the [pytorch](../pytorch) and `OUTPUT1` from the [addsub](../add_sub) model
+to construct the final inference response object using these tensors.
+
+1. Create the model repository:
+
+```console
+$ mkdir -p models/add_sub/1
+$ mkdir -p models/bls_async/1
+$ mkdir -p models/pytorch/1
+
+# Copy the Python models
+$ cp examples/add_sub/model.py models/add_sub/1/
+$ cp examples/add_sub/config.pbtxt models/add_sub/
+$ cp examples/bls/async_model.py models/bls_async/1/model.py
+$ cp examples/bls/async_config.pbtxt models/bls_async/config.pbtxt
+$ cp examples/pytorch/model.py models/pytorch/1/
+$ cp examples/pytorch/config.pbtxt models/pytorch/
+```
+
+2. Start the tritonserver:
+
+```
+tritonserver --model-repository `pwd`/models
+```
+
+3. Send inference requests to server:
+
+```
+python3 examples/bls/async_client.py
+```
+
+You should see an output similar to the output below:
+
+```
+INPUT0 ([0.72394824 0.45873794 0.4307444  0.07681174]) + INPUT1 ([0.34224355 0.8271524  0.5831284  0.904624  ]) = OUTPUT0 ([1.0661918 1.2858903 1.0138729 0.9814357])
+INPUT0 ([0.72394824 0.45873794 0.4307444  0.07681174]) - INPUT1 ([0.34224355 0.8271524  0.5831284  0.904624  ]) = OUTPUT1 ([ 0.3817047  -0.36841443 -0.15238398 -0.82781225])
+```
@@ -0,0 +1,60 @@
+# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+from tritonclient.utils import *
+import tritonclient.http as httpclient
+import numpy as np
+
+model_name = "bls_async"
+shape = [4]
+
+with httpclient.InferenceServerClient("localhost:8000") as client:
+    input0_data = np.random.rand(*shape).astype(np.float32)
+    input1_data = np.random.rand(*shape).astype(np.float32)
+    inputs = [
+        httpclient.InferInput("INPUT0", input0_data.shape,
+                              np_to_triton_dtype(input0_data.dtype)),
+        httpclient.InferInput("INPUT1", input1_data.shape,
+                              np_to_triton_dtype(input1_data.dtype)),
+    ]
+    inputs[0].set_data_from_numpy(input0_data)
+    inputs[1].set_data_from_numpy(input1_data)
+
+    outputs = [
+        httpclient.InferRequestedOutput("OUTPUT0"),
+        httpclient.InferRequestedOutput("OUTPUT1"),
+    ]
+
+    response = client.infer(model_name,
+                            inputs,
+                            request_id=str(1),
+                            outputs=outputs)
+
+    result = response.get_response()
+    print("INPUT0 ({}) + INPUT1 ({}) = OUTPUT0 ({})".format(
+        input0_data, input1_data, response.as_numpy("OUTPUT0")))
+    print("INPUT0 ({}) - INPUT1 ({}) = OUTPUT1 ({})".format(
+        input0_data, input1_data, response.as_numpy("OUTPUT1")))
@@ -0,0 +1,59 @@
+# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "bls_async"
+backend: "python"
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+input [
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+
+instance_group [{ kind: KIND_CPU }]