Skip to content

Consume the Responses in Statement::drop #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 29, 2025
Merged

Conversation

aqrln
Copy link
Member

@aqrln aqrln commented Apr 29, 2025

Instead of dropping the Responses in Statement::drop and closing the channel stored in there, poll it for the response messages.

This appears to fix a weird failure that causes us to never read a response to the query back from the stream if we write a close packet for a prepared statement, and a packet for the portal created by query_typed in quick succession, and then start polling the read part of the stream. Somehow we receive the close success response but not the response to the query. However, if we poll the channel through which the Connection sends the response back to the caller in Drop, everything works fine.

Note that it should not be necessary and this raises more questions than answers.

Reading subsequent messages is not blocked on the response to the close packet not being read. Connection always reads through everything, and it sends the parsed responses to the intrerested parties if the channel for that specific response is still alive, otherwise it just skips over it but it always reads everything to completion. That part works well and there are no issues with it. Moreover, if I just std::mem::forget the Responses in Statement::drop to keep it alive and force the Connection to write the response to the channel's buffer without ever reading it, the bug is still reproducible, it makes no difference. What actually does make difference, judging from the traces, is that polling Responses::next notifies the Connection task's waker: this has a rather subtle effect and influences the timing of the operations, making us poll the socket for reading (which happens concurrently with us writing the query) a little bit earlier. Similarly, not doing anything in Statement::drop, and instead making the Connection artificially wake itself and busy-loop on CPU by calling cx.waker.wake_by_ref() also completely resolves the issue.

The connection task is not deadlocked, and it is periodically woken up on socket. If we leave the seemingly frozen application running, after some time we actually start receiving unrelated asynchronous notices from the server, and we successfully read them! There's just genuinely no response to the query ever received from the server. This means one of the two things: either the write is never actually finished (contrary to the fact that we receive Poll::Ready in poll_flush), or the bug is actually on the server side.

That said, something somewhere else does seem to be blocked, as Ctrl+C doesn't work when the issue occurs, which means the signal handler task is not being polled for some reason. It was a big red herring that turned out to be a separate problem, possibly on our side.

The possibility of a server side bug dependent on the order or timing of operations would also answer the question of "what does any of this have to do with pgbouncer or Supavisor?" to which I don't have a good answer otherwise. The fact that the issue is not reproducible when using normal Postgres is certainly interesting.

This is where I ended up yesterday before I ran out of the timebox allocated for the issue. There's more debugging to do to figure out the root cause, but it's not worth it to pursue this further at the moment.

This patch is probably not upstremable (for one, the introduced dependency on rt feature is somewhat of a breaking change, as it tightly couples the library with Tokio — without it, I believe, it's technically possible to use tokio_postgres with other runtimes or even WASM despite the name) and it's treating the symptom and not the root cause. That said, it's safe and doesn't hurt (for us) and it reliably and consistently fixes #26537/ORM-796, at least on my machine.

I didn't test though if this still fixes the issue (or if it occurs in the first place) if there's less than 3 threads available (to be able to run the tasks for mobc connection, tokio_postgres connection and the background task spawned in Statement::drop in parallel) and I'd rather not block on the future in Statement::drop and degrade the latency for everyone else for the sake of fixing an exotic issue.

Instead of dropping the `Responses` in `Statement::drop` and closing the
channel stored in there, poll the response messages from it.
Base automatically changed from push-sztwulwxzrnp to pgbouncer-mode April 29, 2025 10:52
@aqrln aqrln merged commit 278641f into pgbouncer-mode Apr 29, 2025
@aqrln aqrln deleted the push-qkpqrxvrrvsu branch April 29, 2025 10:54
aqrln added a commit to prisma/prisma-engines that referenced this pull request Apr 29, 2025
aqrln added a commit to prisma/prisma-engines that referenced this pull request Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression in 6.1.0+ causing queries to hang
2 participants