Skip to content

Conversation

@krishung5
Copy link
Contributor

@krishung5 krishung5 commented Jan 11, 2023

In this PR, two APIs, InferenceRequest.stream_exec() and InferenceRequest.async_stream_exec() are added for BLS decoupled support. two arguments, decoupled and execution_timeout are added to the original exec() and async_exec() funstion for BLS decoupled support. Here is the design doc for reference.

Under the hood, chained futures were used for retrieving responses from decoupled models (please refer to the design here). hence, instead of using the generator, the futures will gather the responses and the responses will be returned as a list. A generator that contains all the responses will be returned.

Update: Currently, all the responses will be retrieved first and then transfer it to the user's Python model. For the next release, we would fix this issue and send the responses as they are being received.

For the timeout parameter mentioned in the design doc, the timeout in between two different responses from the generator will not be needed since the implementation of using chained futures handles the possible missing TRITONSERVER_RESPONSE_COMPLETE_FINAL flag leading to infinite loop situation,

Testing: triton-inference-server/server#5245

@krishung5 krishung5 force-pushed the krish-bls-decoupled branch from b2c3ac9 to 89bc459 Compare February 1, 2023 16:58
@krishung5
Copy link
Contributor Author

@Tabrizian The above comments are addressed, please review. For the shm memory leak that I mentioned before, I'm still looking into it and will update here once it's fixed. Thanks!

@krishung5 krishung5 requested a review from Tabrizian February 1, 2023 21:10
README.md Outdated
forever.

- Currently, BLS can not run inference on a decoupled model.
- BLS can not run inference on a decoupled model using functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove these two since they are not a limitations of BLS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify that the limitations which should be removed do not include the one BLS can not run inference on a decoupled model in *async* decoupled mode, is this correct?

Copy link
Member

@Tabrizian Tabrizian Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, we just need to remove the current bullet point. Perhaps we can reword the second bullet point to "Async BLS is not supported when running a Python model in decoupled mode".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the limitation.

@krishung5 krishung5 requested a review from Tabrizian February 6, 2023 20:08
@krishung5 krishung5 requested a review from Tabrizian February 7, 2023 20:07
Tabrizian
Tabrizian previously approved these changes Feb 7, 2023
@krishung5 krishung5 merged commit 0aef2c4 into main Feb 11, 2023
@krishung5 krishung5 deleted the krish-bls-decoupled branch February 11, 2023 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants