Skip to content

Conversation

@code-asher
Copy link
Member

@code-asher code-asher commented Dec 8, 2022

Previously we would spawn the daemon via -Dm then attach with -x once it
was ready. This had a flaw: the daemon starts with a hardcoded 24x80
size and when the attach comes in it resizes and leaves a bunch of
confusing whitespace above your prompt.

Now we skip the daemon spawn and go straight for the attach but with the
addition of -RR which lets screen spawn the daemon for us if it does
not yet exist.

Consequences:

  1. We can only allow one attach at a time because screen has no problem
    creating multiple daemons with the same name.
  2. IDs cannot overlap since screen will do partial matching when we do
    not include the PID which we no longer have.
  3. We do not know when the daemon exits so cleanup only happens on the
    timeout.
  4. We have to kill the session by sending the quit command through
    screen. When we do this it is possible the session is already dead.
  5. If the daemon exits and the user reconnects before the timeout the
    daemon will be respawned all while the Go program remains blissfully
    unaware assuming it has been up this whole time. Does not change
    anything in practice, just a bit different in terms of underlying
    architecture.

In some ways this new architecture is actually simpler with roughly the
same functionality but it does not support concurrent attaches and theoretically
has a greater danger of desyncing from screen's own state (although at the
moment nothing concretely dangerous comes to mind).

So says the warning emitted by the websocket library every time we try
to set StatusAbnormalClosure.
Previously we would spawn the daemon via -Dm then attach with -x once it
was ready.  This had a flaw: the daemon starts with a hardcoded 24x80
size and when the attach comes in it resizes and leaves a bunch of
confusing whitespace above your prompt.

Now we skip the daemon spawn and go straight for the attach but with the
addition of -RR which lets screen spawn the daemon for us if it does
not yet exist.

Consequences:

1. We can only allow one attach at a time because screen has no problem
   creating multiple daemons with the same name.
2. IDs cannot overlap since screen will do partial matching when we do
   not include the PID which we no longer have.
3. We do not know when the daemon exits so cleanup only happens on the
   timeout.
4. We have to kill the session by sending the quit command through
   screen.  When we do this it is possible the session is already dead.
5. If the daemon exits and the user reconnects before the timeout the
   daemon will be respawned all while the Go program remains blissfully
   unaware assuming it has been up this whole time.  Does not change
   anything in practice, just a bit different in terms of underlying
   architecture.

In some ways this new architecture is actually simpler with roughly the
same functionality but it does not support concurrent attaches and has a
greater danger of desyncing from screen's own state.
@code-asher
Copy link
Member Author

This can be experimented with here: https://github.com/coder/v1/pull/13400

@code-asher code-asher marked this pull request as ready for review December 8, 2022 18:25
Copy link
Contributor

@spikecurtis spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, but I have some concerns about leaking screen sessions or the goroutines that are meant to stop them.

@code-asher code-asher force-pushed the fix-connect-size branch 2 times, most recently from 1e3e3c9 to b088440 Compare December 9, 2022 19:02
Hopefully makes it a bit easier to see what is going on.
Trying to figure out why a session is prematurely closing.

To do this I am setting the error via setState (so I can add the reason
at the close call site) rather than checking for a nil error and
returning it in the Attach.  Alternatively I was thinking of adding a
reason arg to setState but only the close state needs a reason and
ultimately it was transformed into the error anyway so might as well do
it earlier and skip a step.
@code-asher code-asher force-pushed the fix-connect-size branch 6 times, most recently from e76f34a to 6391e65 Compare December 9, 2022 20:29
This was canceling while the reconnect test was running.
@code-asher code-asher force-pushed the fix-connect-size branch 2 times, most recently from 54778d7 to 698c327 Compare December 10, 2022 02:32
Having too many seems to be causing some to exit unexpectedly.
@code-asher
Copy link
Member Author

code-asher commented Dec 16, 2022

I think I got all the flakes:

  • There was a 10 second timeout for each test which occasionally prematurely ended the test. Increase to 30 seconds.
  • We wait almost exactly as long as it takes for the session timeout to hit. If the goroutine that deletes the session from the map has not ran yet we can end up grabbing a closed session. Add some code to create a new session when that happens by checking the state first.
  • For similar reasons the heartbeat would fire after closing (since the goroutine that cancels the heartbeat may not have ran yet). Does not cause any flakes but seemed like something worth fixing as it starts off another timer which I assume will delay garbage collection.
  • It seems we can also end up connecting just before the timeout kicks in and closes the session which gives us an existing session and breaks the new session test. To fix this decrease the timeout so we are definitely connecting after the timeout hits and the session closes. Technically this would mask the second flake if it was still an issue though which is unfortunate. Might need to refactor to not use time.Sleep() for these tests instead but maybe this is good enough for now.
  • Lower the simultaneous connections test to 10 at a time. I kept seeing sessions terminating for apparently no reason (not from our code, they just terminated from screen’s end) and I think it was due to the number of sessions being spawned since I have not seen the same problem since lowering it. They do not look like OOMs so I think maybe there is a limit in screen or maybe it hits process limits in CI.

@code-asher code-asher merged commit 5ba2389 into master Dec 19, 2022
@code-asher code-asher deleted the fix-connect-size branch December 19, 2022 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants