Skip to content

flake: TestReinitializeAgent #642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dannykopping opened this issue May 15, 2025 · 5 comments
Open

flake: TestReinitializeAgent #642

dannykopping opened this issue May 15, 2025 · 5 comments
Assignees
Labels

Comments

@dannykopping
Copy link
Collaborator

Seen here: https://github.com/coder/coder/actions/runs/15036339021/job/42258705161#step:9:3375

    workspaceagents_test.go:210: 
        	Error Trace:	C:/actions-runner/coder/coder/coderd/coderdtest/coderdtest.go:1140
        	            				C:/actions-runner/coder/coder/enterprise/coderd/workspaceagents_test.go:210
        	Error:      	Condition never satisfied
        	Test:       	TestReinitializeAgent
@mafredri
Copy link
Member

mafredri commented May 15, 2025

Another one: https://github.com/coder/coder/actions/runs/15040583846/job/42271173671?pr=17845#step:8:584

    t.go:106: 2025-05-15 08:48:02.537 [info]  coderd.workspace_agent_reinit_watcher: agent reinitialization  workspace_agent_id=45eb4eb5-63ef-4add-9dc7-6f7e581d81b6  request_id=25915337-6dcd-4609-8194-f37453c072b4  error="context canceled"
    workspaceagents_test.go:2678: 
        	Error Trace:	/home/runner/work/coder/coder/coderd/workspaceagents_test.go:2678
        	            				/opt/hostedtoolcache/go/1.24.2/x64/src/runtime/asm_amd64.s:1700
        	Error:      	Received unexpected error:
        	            	execute request:
        	            	    github.com/coder/coder/v2/codersdk/agentsdk.(*Client).WaitForReinit
        	            	        /home/runner/work/coder/coder/codersdk/agentsdk/agentsdk.go:737
        	            	  - Get "http://localhost:40399/api/v2/workspaceagents/me/reinit": context deadline exceeded
        	Test:       	TestReinit
    workspaceagents_test.go:2695: 
        	Error Trace:	/home/runner/work/coder/coder/coderd/workspaceagents_test.go:2695
        	Error:      	Expected value not to be nil.
        	Test:       	TestReinit

(I do realize it's a different test, but seems related).

@hugodutka
Copy link

@hugodutka
Copy link

Seen again. https://github.com/coder/coder/actions/runs/15060386746/job/42334329518 The test doesn’t seem to pass on Windows with Postgres.

@dannykopping
Copy link
Collaborator Author

@spikecurtis
Copy link

I think the root cause is

    t.go:106: 2025-05-15 04:10:09.013 [warn]  cli: C:\Users\ADMINI~1\AppData\Local\Temp\TestReinitializeAgent507857436\004\coder-script-d67114ba-5dad-4f8c-87b3-15a126377b51.log script failed  log_source_id=d67114ba-5dad-4f8c-87b3-15a126377b51  log_path="C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\TestReinitializeAgent507857436\\004\\coder-script-d67114ba-5dad-4f8c-87b3-15a126377b51.log"  script_data_dir="C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\coder-script-data\\d67114ba-5dad-4f8c-87b3-15a126377b51"  execution_time=525.737ms  exit_code=1  error="exit status 1"
    t.go:106: 2025-05-15 04:10:09.014 [info]  cli: stderr: 2025-05-15 04:10:09.013 [warn]  cli: C:\Users\ADMINI~1\AppData\Local\Temp\TestReinitializeAgent507857436\004\coder-script-d67114ba-5dad-4f8c-87b3-15a126377b51.log script failed  log_source_id=d67114ba-5dad-4f8c-87b3-15a126377b51  log_path="C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\TestReinitializeAgent507857436\\004\\coder-script-d67114ba-5dad-4f8c-87b3-15a126377b51.log"  script_data_dir="C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\coder-script-data\\d67114ba-5dad-4f8c-87b3-15a126377b51"  execution_time=525.737ms  exit_code=1  error="exit status 1"
    t.go:106: 2025-05-15 04:10:09.014 [warn]  cli: startup script(s) failed  error="run agent script \"d67114ba-5dad-4f8c-87b3-15a126377b51\": exit status 1"
    t.go:106: 2025-05-15 04:10:09.014 [info]  cli: stderr: 2025-05-15 04:10:09.014 [warn]  cli: startup script(s) failed  error="run agent script \"d67114ba-5dad-4f8c-87b3-15a126377b51\": exit status 1"
    t.go:106: 2025-05-15 04:10:09.014 [debu]  cli: set lifecycle state  current={"state":"start_error","changed_at":"2025-05-15T04:10:09.014599Z"}  last={"state":"starting","changed_at":"2025-05-15T04:10:08.45109Z"}
    t.go:106: 2025-05-15 04:10:09.014 [info]  cli: stderr: 2025-05-15 04:10:09.014 [debu]  cli: set lifecycle state  current={"state":"start_error","changed_at":"2025-05-15T04:10:09.014599Z"}  last={"state":"starting","changed_at":"2025-05-15T04:10:08.45109Z"}
    t.go:106: 2025-05-15 04:10:09.014 [debu]  cli: reporting lifecycle state  payload="lifecycle:{state:START_ERROR  changed_at:{seconds:1747282209  nanos:14599000}}"
    t.go:106: 2025-05-15 04:10:09.014 [info]  cli: stderr: 2025-05-15 04:10:09.014 [debu]  cli: reporting lifecycle state  payload="lifecycle:{state:START_ERROR  changed_at:{seconds:1747282209  nanos:14599000}}"

The startup script calls printenv, which isn't an available command on Windows. On my windows system I get

printenv: The term 'printenv' is not recognized as a name of a cmdlet, function, script file, or executable program.
Check the spelling of the name, or if a path was included, verify that the path is correct and try again.

SasSwart added a commit to coder/coder that referenced this issue May 22, 2025
…7968)

relates to coder/internal#642

I've reached a timebox trying to get a script for windows to work, so
I'm skipping it for now.
hugodutka added a commit to coder/coder that referenced this issue May 22, 2025
This PR starts running test-go-pg on macOS and Windows in regular CI.
Previously this suite was only run in the nightly gauntlet for 2
reasons:

- it was flaky
- it was slow (took 17 minutes)

We've since stabilized the flakiness by switching to depot runners,
using ram disks, optimizing the number of tests run in parallel, and
automatically re-running failing tests. We've also [brought
down](#17756) the time to run the
suite to 9 minutes. Additionally, this PR allows test-go-pg to use cache
from previous runs, which speeds it up further. The cache is only used
on PRs, `main` will still run tests without it.

This PR also:

- removes the nightly gauntlet since all tests now run in regular CI
- removes the `test-cli` job for the same reason
- removes the `setup-imdisk` action which is now fully replaced by
[coder/setup-ramdisk-action](https://github.com/coder/setup-ramdisk-action)
- makes 2 minor changes which could be separate PRs, but I rolled them
into this because they were helpful when iterating on it:
- replace the `if: always()` condition on the `gen` job with a `if: ${{
!cancelled() }}` to allow the job to be cancelled. Previously the job
would run to completion even if the entire workflow was cancelled. See
[the GitHub
docs](https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/evaluate-expressions-in-workflows-and-actions#always)
for more details.
- disable the recently added `TestReinitializeAgent` since it does not
pass on Windows with Postgres. There's an open issue to fix it:
coder/internal#642

This PR will:

- unblock #15109
- alleviate coder/internal#647

I tested caching by temporarily enabling cache upload on this PR: here's
[a
run](https://github.com/coder/coder/actions/runs/15119046903/job/42496939341?pr=17853#step:13:1296)
showing cache being used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants