-
Notifications
You must be signed in to change notification settings - Fork 184
Add decoupled support for BLS #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
79dda74 to
b2c3ac9
Compare
b2c3ac9 to
89bc459
Compare
… async_stream_exec()
|
@Tabrizian The above comments are addressed, please review. For the shm memory leak that I mentioned before, I'm still looking into it and will update here once it's fixed. Thanks! |
README.md
Outdated
| forever. | ||
|
|
||
| - Currently, BLS can not run inference on a decoupled model. | ||
| - BLS can not run inference on a decoupled model using functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove these two since they are not a limitations of BLS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify that the limitations which should be removed do not include the one BLS can not run inference on a decoupled model in *async* decoupled mode, is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, we just need to remove the current bullet point. Perhaps we can reword the second bullet point to "Async BLS is not supported when running a Python model in decoupled mode".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the limitation.
In this PR,
two APIs,two arguments,InferenceRequest.stream_exec()andInferenceRequest.async_stream_exec()are added for BLS decoupled support.decoupledandexecution_timeoutare added to the originalexec()andasync_exec()funstion for BLS decoupled support. Here is the design doc for reference.Under the hood, chained futures were used for retrieving responses from decoupled models (please refer to the design here).
hence, instead of using the generator, the futures will gather the responses and the responses will be returned as a list. A generator that contains all the responses will be returned.Update: Currently, all the responses will be retrieved first and then transfer it to the user's Python model. For the next release, we would fix this issue and send the responses as they are being received.
For the timeout parameter mentioned in the design doc, the timeout in between two different responses from the generator will not be needed since the implementation of using chained futures handles the possible missing
TRITONSERVER_RESPONSE_COMPLETE_FINALflag leading to infinite loop situation,Testing: triton-inference-server/server#5245