Skip to content

[🐛 Bug]: Intermittent session errors on previously stable test suite #14580

Closed as not planned
@duncan-bayne

Description

@duncan-bayne

What happened?

Firstly, thanks for Selenium 🙏 We've been using it for 15 years, and have built a large and useful automated test suite with it ❤️

Recently we've been seeing intermittent session errors, in tests that have run (in some cases for years) without issues. These happen at various times in the test lifecycle, from before(:each) to after(:suite). They take the form:

[2024-10-04T05:14:52Z]   1) play playlist in fullscreen mode
[2024-10-04T05:14:52Z]      Got 0 failures and 4 other errors:
[2024-10-04T05:14:52Z]
[2024-10-04T05:14:52Z]      1.1) Failure/Error: super(*args, **kwargs.except(:skip_waiting_for_load))
[2024-10-04T05:14:52Z]
[2024-10-04T05:14:52Z]           Selenium::WebDriver::Error::UnknownError:
[2024-10-04T05:14:52Z]             Unable to execute request for an existing session: java.util.concurrent.TimeoutException
[2024-10-04T05:14:52Z]             Build info: version: '4.20.0', revision: '866c76ca80'
[2024-10-04T05:14:52Z]             System info: os.name: 'Linux', os.arch: 'aarch64', os.version: '6.1.106-116.188.amzn2023.aarch64', java.version: '17.0.11'
[2024-10-04T05:14:52Z]             Driver info: driver.version: unknown
[2024-10-04T05:14:52Z]           # ./spec/support/helpers/test_helpers/capybara_helpers.rb:7:in `visit'
[2024-10-04T05:14:52Z]           # ./spec/features/fullscreen/playlist_spec.rb:33:in `block (2 levels) in <top (required)>'
[2024-10-04T05:14:52Z]           # ./spec/rails_helper.rb:135:in `block (3 levels) in <top (required)>'
[2024-10-04T05:14:52Z]           # ./spec/rails_helper.rb:134:in `block (2 levels) in <top (required)>'
[2024-10-04T05:14:52Z]           # ./spec/support/good_job.rb:7:in `block (2 levels) in <top (required)>'
[2024-10-04T05:14:52Z]           # ./spec/support/database_cleaner.rb:19:in `block (2 levels) in <top (required)>'
[2024-10-04T05:14:52Z]           # ./spec/spec_helper.rb:177:in `block (4 levels) in <top (required)>'
[2024-10-04T05:14:52Z]           # ./app/lib/tiny_otel.rb:86:in `in_span'
[2024-10-04T05:14:52Z]           # ./spec/spec_helper.rb:176:in `block (3 levels) in <top (required)>'
[2024-10-04T05:14:52Z]           # ./spec/spec_helper.rb:163:in `block (2 levels) in <top (required)>'
[2024-10-04T05:14:52Z]           # ------------------
[2024-10-04T05:14:52Z]           # --- Caused by: ---
[2024-10-04T05:14:52Z]           # Selenium::WebDriver::Error::WebDriverError:
[2024-10-04T05:14:52Z]           #   java.lang.RuntimeException: Unable to execute request for an existing session: java.util.concurrent.TimeoutException

Retrying the test will often cause it to pass.

We tried reproducing the failures locally. This was unsuccessful; the failures only happen in CI (running through Buildkite, inside Docker containers).

We then took the following steps to try to eliminate other causes of the intermittent failures. All were unsuccessful.

  1. Specifying a custom HTTP client and increasing the timeout.
  2. Backing off and retrying in the case of these errors.
  3. Upgrading selenium-webdriver Gem to the very latest version (4.25.0).
  4. Upgrading to Chrome 124 from 117.
  5. Connecting to a running CI agent and inspecting. Nothing seemed amiss. CPU, memory, usage all normal.
  6. Bumping shm allocation to 8GiB from 2GiB on the containers.
  7. Downloading system logs from CI agents while running to check for OOM killing.
  8. Detaching a CI instance from Buildkite, and run the tests manually there. The intermittent session errors persisted.
  9. Doubling the size of the CI agents from c7g.8xlarge to c7g.16xlarge, in case of resource constraints.
  10. Switching from Chrome to Firefox (I was particularly surprised that this didn't resolve the issue).
  11. Explicitly closing the session at the end of every suite.
  12. Removing all session resets from tests and helpers.
  13. Trying x86 instead of ARM build agents.
  14. Running just one test group manually. This also produced the same session errors, eliminating the sheer number of tests, or the number of parallel Docker containers.
  15. Downgrading selenium-webdriver to a version from 2023.
  16. Installing Chromium and Chromedriver locally on the test container, instead of using a remote browser configuration across containers. This produced the same session errors, but with a local failure message.

We also tested switching from Selenium to Cuprite; when running under Cuprite, the tests worked reliably, which led us to believe the issue lies with Selenium.

How can we reproduce the issue?

We can't reproduce this outside the context of our CI system; however the debugging steps above lead me to believe it's an issue with Selenium.

If debugging on our stack could help resolve the issue, I'm very happy to look into how we might make that possible.

Relevant log output

When running Chrome in its own container via a remote browser connection, this is what we see in the logs when one of these failures occur:

2024-10-07 05:37:09 DEBUG Selenium [:command] -> GET session/c000c2996cd5f723ea858b5e9a58281b/window 
2024-10-07 05:37:09 DEBUG Selenium [:header]    <<<  {"content-length"=>["44"], "content-type"=>["application/json; charset=utf-8"], "cache-control"=>["no-cache"]} 
2024-10-07 05:37:09 DEBUG Selenium [:command] <- {"value":"316ADB5F14A7C4783939C2D19625AD72"} 
2024-10-07 05:37:09 DEBUG Selenium [:command] -> POST session/c000c2996cd5f723ea858b5e9a58281b/window/rect 
2024-10-07 05:37:09 DEBUG Selenium [:command]    >>> http://chrome:4444/wd/hub/session/c000c2996cd5f723ea858b5e9a58281b/window/rect | {"x":null,"y":null,"width":1920,"height":1080} 
2024-10-07 05:37:09 DEBUG Selenium [:header]      > {"Accept"=>"application/json", "Content-Type"=>"application/json; charset=UTF-8", "User-Agent"=>"selenium/4.25.0 (ruby linux)", "Content-Length"=>"46"} 
2024-10-07 05:37:09 DEBUG Selenium [:header]    <<<  {"content-length"=>["50"], "content-type"=>["application/json; charset=utf-8"], "cache-control"=>["no-cache"]} 
2024-10-07 05:37:09 DEBUG Selenium [:command] <- {"value":{"height":1080,"width":1920,"x":0,"y":0}} 
2024-10-07 05:37:09 DEBUG Selenium [:command] -> POST session/c000c2996cd5f723ea858b5e9a58281b/goog/cdp/execute 
2024-10-07 05:37:09 DEBUG Selenium [:command]    >>> http://chrome:4444/wd/hub/session/c000c2996cd5f723ea858b5e9a58281b/goog/cdp/execute | {"cmd":"Emulation.setUserAgentOverride","params":{"userAgent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"}} 
2024-10-07 05:37:09 DEBUG Selenium [:header]      > {"Accept"=>"application/json", "Content-Type"=>"application/json; charset=UTF-8", "User-Agent"=>"selenium/4.25.0 (ruby linux)", "Content-Length"=>"167"} 
2024-10-07 05:37:09 DEBUG Selenium [:header]    <<<  {"content-length"=>["12"], "content-type"=>["application/json; charset=utf-8"], "cache-control"=>["no-cache"]} 
2024-10-07 05:37:09 DEBUG Selenium [:command] <- {"value":{}} 
2024-10-07 05:37:09 DEBUG Selenium [:command] -> POST session/c000c2996cd5f723ea858b5e9a58281b/goog/cdp/execute 
2024-10-07 05:37:09 DEBUG Selenium [:command]    >>> http://chrome:4444/wd/hub/session/c000c2996cd5f723ea858b5e9a58281b/goog/cdp/execute | {"cmd":"Emulation.setTouchEmulationEnabled","params":{"enabled":false}} 
2024-10-07 05:37:09 DEBUG Selenium [:header]      > {"Accept"=>"application/json", "Content-Type"=>"application/json; charset=UTF-8", "User-Agent"=>"selenium/4.25.0 (ruby linux)", "Content-Length"=>"71"} 
2024-10-07 05:37:09 DEBUG Selenium [:header]    <<<  {"content-length"=>["12"], "content-type"=>["application/json; charset=utf-8"], "cache-control"=>["no-cache"]} 
2024-10-07 05:37:09 DEBUG Selenium [:command] <- {"value":{}} 
2024-10-07 05:37:09 DEBUG Selenium [:command] -> POST session/c000c2996cd5f723ea858b5e9a58281b/goog/cdp/execute 
2024-10-07 05:37:09 DEBUG Selenium [:command]    >>> http://chrome:4444/wd/hub/session/c000c2996cd5f723ea858b5e9a58281b/goog/cdp/execute | {"cmd":"Emulation.clearDeviceMetricsOverride","params":{}} 
2024-10-07 05:37:09 DEBUG Selenium [:header]      > {"Accept"=>"application/json", "Content-Type"=>"application/json; charset=UTF-8", "User-Agent"=>"selenium/4.25.0 (ruby linux)", "Content-Length"=>"58"} 
2024-10-07 05:37:09 DEBUG Selenium [:header]    <<<  {"content-length"=>["12"], "content-type"=>["application/json; charset=utf-8"], "cache-control"=>["no-cache"]} 
2024-10-07 05:37:09 DEBUG Selenium [:command] <- {"value":{}} 
2024-10-07 05:37:09 DEBUG Selenium [:command] -> POST session/c000c2996cd5f723ea858b5e9a58281b/url 
2024-10-07 05:37:09 DEBUG Selenium [:command]    >>> http://chrome:4444/wd/hub/session/c000c2996cd5f723ea858b5e9a58281b/url | {"url":"http://test.host:30000/playlists/874/edit"} 
2024-10-07 05:37:09 DEBUG Selenium [:header]      > {"Accept"=>"application/json", "Content-Type"=>"application/json; charset=UTF-8", "User-Agent"=>"selenium/4.25.0 (ruby linux)", "Content-Length"=>"51"} 
2024-10-07 05:37:09 DEBUG Selenium [:header]    <<<  {"content-length"=>["5228"], "cache-control"=>["none"], "content-type"=>["application/json; charset=utf-8"]} 
2024-10-07 05:37:09 DEBUG Selenium [:command] <- {
  "value": {
    "error": "unknown error",
    "message": "Unable to execute request for an existing session: java.util.concurrent.TimeoutException\nBuild info: version: '4.20.0', revision: '866c76ca80'\nSystem info: os.name: 'Linux', os.arch: 'aarch64', os.version: '6.1.106-116.188.amzn2023.aarch64', java.version: '17.0.11'\nDriver info: driver.version: unknown",
    "stacktrace": "java.lang.RuntimeException: Unable to execute request for an existing session: java.util.concurrent.TimeoutException\nBuild info: version: '4.20.0', revision: '866c76ca80'\nSystem info: os.name: 'Linux', os.arch: 'aarch64', os.version: '6.1.106-116.188.amzn2023.aarch64', java.version: '17.0.11'\nDriver info: driver.version: unknown\n\tat org.openqa.selenium.grid.router.HandleSession.execute(HandleSession.java:208)\n\tat org.openqa.selenium.remote.http.Route$PredicatedRoute.handle(Route.java:383)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:69)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:346)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:69)\n\tat org.openqa.selenium.grid.router.Router.execute(Router.java:87)\n\tat org.openqa.selenium.grid.web.EnsureSpecCompliantResponseHeaders.lambda$apply$0(EnsureSpecCompliantResponseHeaders.java:34)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:346)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:69)\n\tat org.openqa.selenium.remote.http.Route$NestedRoute.handle(Route.java:270)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:69)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:346)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:69)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:346)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:69)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:346)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:69)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:346)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:69)\n\tat org.openqa.selenium.remote.AddWebDriverSpecHeaders.lambda$apply$0(AddWebDriverSpecHeaders.java:35)\n\tat org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)\n\tat org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)\n\tat org.openqa.selenium.netty.server.SeleniumHandler.lambda$channelRead0$0(SeleniumHandler.java:44)\n\tat java.base\u002fjava.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)\n\tat java.base\u002fjava.util.concurrent.FutureTask.run(Unknown Source)\n\tat java.base\u002fjava.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n\tat java.base\u002fjava.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n\tat java.base\u002fjava.lang.Thread.run(Unknown Source)\nCaused by: java.util.concurrent.TimeoutException\n\tat java.base\u002fjava.util.concurrent.CompletableFuture.timedGet(Unknown Source)\n\tat java.base\u002fjava.util.concurrent.CompletableFuture.get(Unknown Source)\n\tat org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute0(JdkHttpClient.java:379)\n\tat org.openqa.selenium.remote.http.AddSeleniumUserAgent.lambda$apply$0(AddSeleniumUserAgent.java:42)\n\tat org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:55)\n\tat org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute(JdkHttpClient.java:352)\n\tat org.openqa.selenium.remote.tracing.TracedHttpClient.execute(TracedHttpClient.java:54)\n\tat org.openqa.selenium.grid.web.ReverseProxyHandler.execute(ReverseProxyHandler.java:90)\n\tat org.openqa.selenium.grid.node.DefaultActiveSession.execute(DefaultActiveSession.java:62)\n\tat org.openqa.selenium.grid.node.local.SessionSlot.execute(SessionSlot.java:126)\n\tat org.openqa.selenium.grid.node.local.LocalNode.executeWebDriverCommand(LocalNode.java:628)\n\tat org.openqa.selenium.grid.node.ForwardWebDriverCommand.execute(ForwardWebDriverCommand.java:35)\n\tat org.openqa.selenium.remote.http.Route$PredicatedRoute.handle(Route.java:383)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:69)\n\tat org.openqa.selenium.remote.tracing.SpanWrappedHttpHandler.execute(SpanWrappedHttpHandler.java:87)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:346)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:69)\n\tat org.openqa.selenium.grid.node.Node.execute(Node.java:262)\n\tat org.openqa.selenium.grid.web.CombinedHandler.execute(CombinedHandler.java:55)\n\tat org.openqa.selenium.grid.web.RoutableHttpClientFactory$1.execute(RoutableHttpClientFactory.java:71)\n\tat org.openqa.selenium.grid.web.ReverseProxyHandler.execute(ReverseProxyHandler.java:90)\n\tat org.openqa.selenium.grid.router.HandleSession.execute(HandleSession.java:181)\n\t... 30 more\n"
  }
} 
2024-10-07 05:37:09 DEBUG Selenium [:command] -> GET session/c000c2996cd5f723ea858b5e9a58281b/screenshot 
2024-10-07 05:37:09 DEBUG Selenium [:header]    <<<  {"content-length"=>["252"], "cache-control"=>["no-cache"], "content-type"=>["application/json; charset=utf-8"]} 
2024-10-07 05:37:09 DEBUG Selenium [:command] <- {
  "value": "\u002fsession\u002fc000c2996cd5f723ea858b5e9a58281b\u002fscreenshot",
  "message": "Unable to route (GET) \u002fsession\u002fc000c2996cd5f723ea858b5e9a58281b\u002fscreenshot",
  "error": "org.openqa.selenium.UnsupportedCommandException"
} 
2024-10-07 05:37:09 DEBUG Selenium [:command] -> DELETE session/c000c2996cd5f723ea858b5e9a58281b/cookie 
2024-10-07 05:37:09 DEBUG Selenium [:header]    <<<  {"content-length"=>["247"], "cache-control"=>["no-cache"], "content-type"=>["application/json; charset=utf-8"]} 
2024-10-07 05:37:09 DEBUG Selenium [:command] <- {
  "value": "\u002fsession\u002fc000c2996cd5f723ea858b5e9a58281b\u002fcookie",
  "message": "Unable to route (DELETE) \u002fsession\u002fc000c2996cd5f723ea858b5e9a58281b\u002fcookie",
  "error": "org.openqa.selenium.UnsupportedCommandException"
} 
2024-10-07 05:42:12 DEBUG Selenium [:command] -> DELETE session/c000c2996cd5f723ea858b5e9a58281b 
2024-10-07 05:42:12 DEBUG Selenium [:header]    <<<  {"content-length"=>["223"], "cache-control"=>["no-cache"], "content-type"=>["application/json; charset=utf-8"]} 
2024-10-07 05:42:12 DEBUG Selenium [:command] <- {
  "value": "\u002fsession\u002fc000c2996cd5f723ea858b5e9a58281b",
  "message": "Unable to route (DELETE) \u002fsession\u002fc000c2996cd5f723ea858b5e9a58281b",
  "error": "org.openqa.selenium.UnsupportedCommandException"
} 

When running Chromium locally on the test container, we see the following error:

 1.1) Failure/Error: page.driver.resize_window_to(page.driver.current_window_handle, 1920, 1080)   
                                                                                                       
          Selenium::WebDriver::Error::WebDriverError:                                                  
            unable to connect to /root/.cache/selenium/chromedriver/linux64/129.0.6668.89/chromedriver 
127.0.0.1:9515                                     
          # ./spec/support/selenium.rb:31:in `block (2 levels) in <top (required)>'                    
          # ./spec/rails_helper.rb:135:in `block (3 levels) in <top (required)>'                       
          # ./spec/rails_helper.rb:134:in `block (2 levels) in <top (required)>'                       
          # ./spec/support/good_job.rb:7:in `block (2 levels) in <top (required)>'                     
          # ./spec/support/database_cleaner.rb:19:in `block (2 levels) in <top (required)>'            
          # ./spec/spec_helper.rb:177:in `block (4 levels) in <top (required)>'                        
          # ./app/lib/tiny_otel.rb:86:in `in_span'
          # ./spec/spec_helper.rb:176:in `block (3 levels) in <top (required)>'
          # ./spec/spec_helper.rb:163:in `block (2 levels) in <top (required)>'  

Operating System

Ubuntu 22.04 LTS

Selenium version

Ruby 3.3.5

What are the browser(s) and version(s) where you see this issue?

Chrome 124, Chrome 117, Firefox 130.

What are the browser driver(s) and version(s) where you see this issue?

ChromeDriver 124.0.6367.78 (and others; see above for attempts to resolve the issue).

Are you using Selenium Grid?

No.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-rbRuby BindingsI-defectSomething is not working as intended

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions