Consume the Responses
in Statement::drop
#8
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Instead of dropping the
Responses
inStatement::drop
and closing the channel stored in there, poll it for the response messages.This appears to fix a weird failure that causes us to never read a response to the query back from the stream if we write a close packet for a prepared statement, and a packet for the portal created by
query_typed
in quick succession, and then start polling the read part of the stream. Somehow we receive the close success response but not the response to the query. However, if we poll the channel through which theConnection
sends the response back to the caller inDrop
, everything works fine.Note that it should not be necessary and this raises more questions than answers.
Reading subsequent messages is not blocked on the response to the close packet not being read.
Connection
always reads through everything, and it sends the parsed responses to the intrerested parties if the channel for that specific response is still alive, otherwise it just skips over it but it always reads everything to completion. That part works well and there are no issues with it. Moreover, if I juststd::mem::forget
theResponses
inStatement::drop
to keep it alive and force theConnection
to write the response to the channel's buffer without ever reading it, the bug is still reproducible, it makes no difference. What actually does make difference, judging from the traces, is that pollingResponses::next
notifies theConnection
task's waker: this has a rather subtle effect and influences the timing of the operations, making us poll the socket for reading (which happens concurrently with us writing the query) a little bit earlier. Similarly, not doing anything inStatement::drop
, and instead making theConnection
artificially wake itself and busy-loop on CPU by callingcx.waker.wake_by_ref()
also completely resolves the issue.The connection task is not deadlocked, and it is periodically woken up on socket. If we leave the seemingly frozen application running, after some time we actually start receiving unrelated asynchronous notices from the server, and we successfully read them! There's just genuinely no response to the query ever received from the server. This means one of the two things: either the write is never actually finished (contrary to the fact that we receive
Poll::Ready
inpoll_flush
), or the bug is actually on the server side.That said, something somewhere else does seem to be blocked, as Ctrl+C doesn't work when the issue occurs, which means the signal handler task is not being polled for some reason. It was a big red herring that turned out to be a separate problem, possibly on our side.
The possibility of a server side bug dependent on the order or timing of operations would also answer the question of "what does any of this have to do with pgbouncer or Supavisor?" to which I don't have a good answer otherwise. The fact that the issue is not reproducible when using normal Postgres is certainly interesting.
This is where I ended up yesterday before I ran out of the timebox allocated for the issue. There's more debugging to do to figure out the root cause, but it's not worth it to pursue this further at the moment.
This patch is probably not upstremable (for one, the introduced dependency on
rt
feature is somewhat of a breaking change, as it tightly couples the library with Tokio — without it, I believe, it's technically possible to usetokio_postgres
with other runtimes or even WASM despite the name) and it's treating the symptom and not the root cause. That said, it's safe and doesn't hurt (for us) and it reliably and consistently fixes #26537/ORM-796, at least on my machine.I didn't test though if this still fixes the issue (or if it occurs in the first place) if there's less than 3 threads available (to be able to run the tasks for mobc connection, tokio_postgres connection and the background task spawned in
Statement::drop
in parallel) and I'd rather not block on the future inStatement::drop
and degrade the latency for everyone else for the sake of fixing an exotic issue.