Ambari 3.0.0 滚动重启 datanode 只执行一台主机修复

一、问题现象:滚动更新后,只执行了一台

先看现象截图:滚动更新触发后,页面表现为“只更新一个”。

image-20260216201906287

滚动更新后:

image-20260216202038809

现象关键词

  • 滚动重启 / 滚动更新触发成功
  • 实际执行只落到一台主机(例如只重启 dev1.test.comDATANODE
  • 后续任务链直接退出,不再继续执行下一批 host

1、快速判断:不是 Agent 下发问题,而是调度链断了

从日志可以看到:命令仍然下发到了 agent(AgentCommandsPublisher.sendCommands),说明并不是“没下发”。

而真正致命的是:调度器执行 BatchRequest 的链路抛异常后退出,后续 host 的滚动步骤就没机会继续推进。

二、后台日志:ExecutionScheduleManager 抛 ClassCastException

直接看核心错误(截取关键段):

2026-02-16 20:19:49,706 INFO  [ambari-client-thread-45] o.a.a.s.state.cluster.ClusterImpl:558 - Adding a new request schedule, clusterName = abc, id = 53, description = null
2026-02-16 20:19:49,706 INFO  [ambari-client-thread-45] o.a.a.s.state.cluster.ClusterImpl:558 - Adding a new request schedule, clusterName = abc, id = 53, description = null
2026-02-16 20:19:49,772 INFO  [ambari-client-thread-119] o.a.a.s.c.AmbariManagementControllerImpl:4152 - Received action execution request, clusterName=abc, request=isCommand :true, action :null, command :RESTART, inputs :{HAS_RESOURCE_FILTERS=true}, resourceFilters: [RequestResourceFilter{serviceName='HDFS', componentName='DATANODE', hostNames=[dev1.test.com]}], exclusive: false, clusterName :abc
2026-02-16 20:19:49,772 INFO  [ambari-client-thread-119] o.a.a.s.c.AmbariManagementControllerImpl:4152 - Received action execution request, clusterName=abc, request=isCommand :true, action :null, command :RESTART, inputs :{HAS_RESOURCE_FILTERS=true}, resourceFilters: [RequestResourceFilter{serviceName='HDFS', componentName='DATANODE', hostNames=[dev1.test.com]}], exclusive: false, clusterName :abc
2026-02-16 20:19:49,813 INFO  [ambari-client-thread-119] o.a.a.server.stageplanner.RoleGraph:175 - Detecting cycle graphs
2026-02-16 20:19:49,813 INFO  [ambari-client-thread-119] o.a.a.server.stageplanner.RoleGraph:175 - Detecting cycle graphs
2026-02-16 20:19:49,814 INFO  [ambari-client-thread-119] o.a.a.server.stageplanner.RoleGraph:176 - Graph:
(DATANODE, RESTART, 0)

2026-02-16 20:19:49,814 INFO  [ambari-client-thread-119] o.a.a.server.stageplanner.RoleGraph:176 - Graph:
(DATANODE, RESTART, 0)

2026-02-16 20:19:49,843 ERROR [ExecutionScheduler_Worker-2] o.a.a.s.s.AbstractLinearExecutionJob:93 - Exception caught on execution of job LinearExecutionJobs.BatchRequestJob-53-1. Exiting linear chain...
org.apache.ambari.server.AmbariException: Exception occurred while performing request
        at org.apache.ambari.server.scheduler.ExecutionScheduleManager.executeBatchRequest(ExecutionScheduleManager.java:683)
        at org.apache.ambari.server.state.scheduler.BatchRequestJob.doWork(BatchRequestJob.java:82)
        at org.apache.ambari.server.scheduler.AbstractLinearExecutionJob.execute(AbstractLinearExecutionJob.java:91)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.ClassCastException: class org.glassfish.jersey.client.internal.HttpUrlConnector$1 cannot be cast to class java.lang.String (org.glassfish.jersey.client.internal.HttpUrlConnector$1 is in unnamed module of loader 'app'; java.lang.String is in module java.base of loader 'bootstrap')
        at org.apache.ambari.server.scheduler.ExecutionScheduleManager.convertToBatchRequestResponse(ExecutionScheduleManager.java:740)
        at org.apache.ambari.server.scheduler.ExecutionScheduleManager.performApiRequest(ExecutionScheduleManager.java:942)
        at org.apache.ambari.server.scheduler.ExecutionScheduleManager.executeBatchRequest(ExecutionScheduleManager.java:671)
        ... 4 common frames omitted
2026-02-16 20:19:49,843 ERROR [ExecutionScheduler_Worker-2] o.a.a.s.s.AbstractLinearExecutionJob:93 - Exception caught on execution of job LinearExecutionJobs.BatchRequestJob-53-1. Exiting linear chain...
org.apache.ambari.server.AmbariException: Exception occurred while performing request
        at org.apache.ambari.server.scheduler.ExecutionScheduleManager.executeBatchRequest(ExecutionScheduleManager.java:683)
        at org.apache.ambari.server.state.scheduler.BatchRequestJob.doWork(BatchRequestJob.java:82)
        at org.apache.ambari.server.scheduler.AbstractLinearExecutionJob.execute(AbstractLinearExecutionJob.java:91)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.ClassCastException: class org.glassfish.jersey.client.internal.HttpUrlConnector$1 cannot be cast to class java.lang.String (org.glassfish.jersey.client.internal.HttpUrlConnector$1 is in unnamed module of loader 'app'; java.lang.String is in module java.base of loader 'bootstrap')
        at org.apache.ambari.server.scheduler.ExecutionScheduleManager.convertToBatchRequestResponse(ExecutionScheduleManager.java:740)
        at org.apache.ambari.server.scheduler.ExecutionScheduleManager.performApiRequest(ExecutionScheduleManager.java:942)
        at org.apache.ambari.server.scheduler.ExecutionScheduleManager.executeBatchRequest(ExecutionScheduleManager.java:671)
        ... 4 common frames omitted
2026-02-16 20:19:49,855 INFO  [ExecutionScheduler_Worker-2] org.quartz.core.JobRunShell:207 - Job LinearExecutionJobs.BatchRequestJob-53-1 threw a JobExecutionException: 
org.quartz.JobExecutionException: org.apache.ambari.server.AmbariException: Exception occurred while performing request
        at org.apache.ambari.server.scheduler.AbstractLinearExecutionJob.execute(AbstractLinearExecutionJob.java:97)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: org.apache.ambari.server.AmbariException: Exception occurred while performing request
        at org.apache.ambari.server.scheduler.ExecutionScheduleManager.executeBatchRequest(ExecutionScheduleManager.java:683)
        at org.apache.ambari.server.state.scheduler.BatchRequestJob.doWork(BatchRequestJob.java:82)
        at org.apache.ambari.server.scheduler.AbstractLinearExecutionJob.execute(AbstractLinearExecutionJob.java:91)
        ... 2 common frames omitted
Caused by: java.lang.ClassCastException: class org.glassfish.jersey.client.internal.HttpUrlConnector$1 cannot be cast to class java.lang.String (org.glassfish.jersey.client.internal.HttpUrlConnector$1 is in unnamed module of loader 'app'; java.lang.String is in module java.base of loader 'bootstrap')
        at org.apache.ambari.server.scheduler.ExecutionScheduleManager.convertToBatchRequestResponse(ExecutionScheduleManager.java:740)
        at org.apache.ambari.server.scheduler.ExecutionScheduleManager.performApiRequest(ExecutionScheduleManager.java:942)
        at org.apache.ambari.server.scheduler.ExecutionScheduleManager.executeBatchRequest(ExecutionScheduleManager.java:671)
        ... 4 common frames omitted
2026-02-16 20:19:49,892 INFO  [agent-command-publisher-0] o.a.a.s.e.p.AgentCommandsPublisher:173 - AgentCommandsPublisher.sendCommands: sending ExecutionCommand for host dev1.test.com, role DATANODE, roleCommand CUSTOM_COMMAND, and command ID 142-0, task ID 1103
2026-02-16 20:19:49,892 INFO  [agent-command-publisher-0] o.a.a.s.e.p.AgentCommandsPublisher:173 - AgentCommandsPublisher.sendCommands: sending ExecutionCommand for host dev1.test.com, role DATANODE, roleCommand CUSTOM_COMMAND, and command ID 142-0, task ID 1103
2026-02-16 20:19:50,092 INFO  [agent-message-monitor-0] o.a.a.server.events.MessageEmitter:218 - Schedule execution command emitting, retry: 0, messageId: 1
2026-02-16 20:19:50,092 INFO  [agent-message-monitor-0] o.a.a.server.events.MessageEmitter:218 - Schedule execution command emitting, retry: 0, messageId: 1
2026-02-16 20:19:50,094 WARN  [agent-message-retry-0] o.a.a.server.events.MessageEmitter:255 - Reschedule execution command emitting, retry: 1, messageId: 1
2026-02-16 20:19:50,094 WARN  [agent-message-retry-0] o.a.a.server.events.MessageEmitter:255 - Reschedule execution command emitting, retry: 1, messageId: 1
2026-02-16 20:19:58,414 WARN  [ambari-client-thread-45] o.glassfish.jersey.internal.Errors:168 - The following warnings have been detected: WARNING: A HTTP GET method, public javax.ws.rs.core.Response org.apache.ambari.server.api.services.TaskService.getTask(java.lang.String,javax.ws.rs.core.HttpHeaders,javax.ws.rs.core.UriInfo,java.lang.String), should not consume any entity.
WARNING: A HTTP GET method, public javax.ws.rs.core.Response org.apache.ambari.server.api.services.TaskService.getComponents(java.lang.String,javax.ws.rs.core.HttpHeaders,javax.ws.rs.core.UriInfo), should not consume any entity.

结论先行
真正导致“滚动重启只执行一台”的原因:
线性调度链(LinearExecutionJob)在执行 BatchRequestJob 时抛异常 → 直接退出 linear chain → 后续滚动步骤不再继续。

1、异常点位归因

维度内容
触发线程ExecutionScheduler_Worker-*
任务类型LinearExecutionJobs.BatchRequestJob-<scheduleId>-<seq>
入口方法ExecutionScheduleManager.executeBatchRequest
根因方法ExecutionScheduleManager.convertToBatchRequestResponse
异常类型ClassCastException (HttpUrlConnector$1 -> String)

为什么是 ClassCastException?
Jersey 2.x 下,Response.getEntity() 返回的对象并不保证是 String
如果代码把 entity 强转 String,在某些 connector/stream 场景会出现 HttpUrlConnector$1 这种内部类型,直接炸。

三、修复点:替换 ExecutionScheduleManager.convertToBatchRequestResponse

本次修改文件:


处理办法可参考
22213:源码解决办法


四、构建与替换:最短路径落地到生产

1、编译命令(跳过 RAT / Checkstyle)

mvn -DskipTests -Drat.skip=true -Dcheckstyle.skip=true package

编译注意
如果本地环境有 checkstyle / rat 之类的强校验,直接跳过能节省大量时间;等修复验证通过再回头补规范更高效。

2、替换 ambari-server 产物到 /usr/lib/ambari-server

编译完成后,将生成的 ambari-server 相关 jar 替换到目标机器:

image-20260217230006961

替换建议(生产习惯)

  • 先备份:cp -a xxx.jar xxx.jar.bak.$(date +%F_%T)
  • 替换后重启 ambari-server 再验证
  • 验证通过后再清理备份,避免回滚成本升高

五、验证结果:重试后滚动执行恢复正常

替换完成后重试,滚动更新成功:

image-20260217225907686

日志侧也能看到行为恢复:

image-20260217230244500

验证清单(建议照着勾)

  • 1)触发滚动重启后,是否会持续推进到下一台 host
  • 2)ExecutionScheduler_Worker-* 是否还出现 ClassCastException
  • 3)LinearExecutionJobs.BatchRequestJob-* 是否还出现 “Exiting linear chain…”
  • 4)若开启 debug,确认能打印 Ambari API raw response(便于未来排障)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

TTBIGDATA

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值