无法采集node_ntp指标怎么监控ntp服务状态

最新推荐文章于 2026-06-23 11:22:12 发布

原创最新推荐文章于 2026-06-23 11:22:12 发布 · 623 阅读

3 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

收录于

监控告警

上一篇文章介绍systemd 服务如果获取ntp 监控指标的方式
使用damonset方式部署node-exporter时，本文介绍无法采集node_ntp指标的情况下如何采集ntp状态监控数据。

Node Exporter NTP 指标排查指南

问题描述

在 node-exporter 配置中添加了 --collector.systemd.unit-include=^(kubelet|docker|ntpd)\.service$，但未暴露 node_ntp_* 指标。

可能原因及解决方案

1. 服务名称不匹配 ⚠️ 最常见原因

不同 Linux 发行版使用的 NTP 服务名称可能不同：

CentOS/RHEL 7: ntpd.service
CentOS/RHEL 8+: chronyd.service
Ubuntu/Debian: ntp.service 或 systemd-timesyncd.service
某些系统: chrony.service

解决方案：

# 在节点上检查实际的 NTP 服务名称
systemctl list-units --type=service | grep -i ntp
systemctl list-units --type=service | grep -i chrony
systemctl list-units --type=service | grep -i time

# 或者查看所有 systemd 服务
systemctl list-units --type=service --all | grep -E "(ntp|chrony|time)"

修改配置示例：

# 如果服务名是 chronyd
args:
  - --path.rootfs=/host
  - --collector.systemd
  - --collector.systemd.unit-include=^(kubelet|docker|chronyd)\.service$

# 如果服务名是 ntp
args:
  - --path.rootfs=/host
  - --collector.systemd
  - --collector.systemd.unit-include=^(kubelet|docker|ntp)\.service$

# 或者包含多个可能的服务名
args:
  - --path.rootfs=/host
  - --collector.systemd
  - --collector.systemd.unit-include=^(kubelet|docker|ntpd|ntp|chronyd|chrony)\.service$

2. 容器缺少 systemd 访问权限

node-exporter 容器需要访问主机的 systemd 才能收集指标。

检查项：

# DaemonSet 需要以下挂载和权限
spec:
  template:
    spec:
      containers:
      - name: node-exporter-arm64
        volumeMounts:
        - name: systemd
          mountPath: /host/run/systemd
          readOnly: true
        - name: rootfs
          mountPath: /host
          readOnly: true
      volumes:
      - name: systemd
        hostPath:
          path: /run/systemd
      - name: rootfs
        hostPath:
          path: /

如果使用 Docker socket（可选，用于 docker 指标）：

volumeMounts:
- name: docker-socket
  mountPath: /var/run/docker.sock
volumes:
- name: docker-socket
  hostPath:
    path: /var/run/docker.sock

3. systemd collector 未正确启用

确保 --collector.systemd 参数存在且正确。

检查配置：

args:
  - --path.rootfs=/host
  - --collector.systemd                    # 必须启用
  - --collector.systemd.unit-include=...   # 然后指定包含的服务

4. 服务未运行

即使配置正确，如果 NTP 服务未运行，也不会产生指标。

检查服务状态：

# 在节点上执行
systemctl status ntpd
systemctl status chronyd
systemctl status ntp
systemctl status systemd-timesyncd

5. 正则表达式语法问题

确保正则表达式语法正确。

正确的格式：

# 使用 ^ 和 $ 确保完全匹配
--collector.systemd.unit-include=^(kubelet|docker|ntpd)\.service$

# 注意：\.service$ 中的点需要转义

错误示例：

# ❌ 错误：点未转义
--collector.systemd.unit-include=^(kubelet|docker|ntpd).service$

# ❌ 错误：缺少开始/结束锚点
--collector.systemd.unit-include=kubelet|docker|ntpd

排查步骤

步骤 1: 检查服务名称

# 在 Kubernetes 节点上执行
kubectl debug node/<node-name> -it --image=busybox -- sh
# 或直接 SSH 到节点
systemctl list-units --type=service | grep -E "(ntp|chrony|time)"

步骤 2: 检查 node-exporter 日志

kubectl logs -n kube-system daemonset/node-exporter-arm64 | grep -i systemd
kubectl logs -n kube-system daemonset/node-exporter-arm64 | grep -i error

步骤 3: 检查指标端点

# 获取 Pod IP
POD_IP=$(kubectl get pod -n kube-system -l name=node-exporter-arm64 -o jsonpath='{.items[0].status.podIP}')

# 检查是否有 systemd 相关指标
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -s http://${POD_IP}:9100/metrics | grep -i systemd

# 检查是否有 ntp 相关指标
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -s http://${POD_IP}:9100/metrics | grep -i ntp

步骤 4: 验证配置已生效

# 检查 DaemonSet 配置
kubectl get ds node-exporter-arm64 -n kube-system -o yaml | grep -A 10 args

# 检查 Pod 实际运行的参数
kubectl get pod -n kube-system -l name=node-exporter-arm64 -o jsonpath='{.items[0].spec.containers[0].args}'

步骤 5: 测试 systemd 访问

# 进入容器检查
kubectl exec -n kube-system -it <node-exporter-pod-name> -- sh

# 在容器内检查
ls -la /host/run/systemd
ls -la /host/etc/systemd/system/ | grep -i ntp

完整配置示例

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter-arm64
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: node-exporter-arm64
  template:
    metadata:
      labels:
        app: prometheus
        name: node-exporter-arm64
    spec:
      containers:
      - name: node-exporter-arm64
        image: *****/node-exporter-linux-arm64:latest
        imagePullPolicy: IfNotPresent
        args:
          - --path.rootfs=/host
          - --collector.systemd
          - --collector.systemd.unit-include=^(kubelet|docker|ntpd|chronyd|ntp)\.service$
        volumeMounts:
        - name: systemd
          mountPath: /host/run/systemd
          readOnly: true
        - name: rootfs
          mountPath: /host
          readOnly: true
        - name: docker-socket
          mountPath: /var/run/docker.sock
          readOnly: true
      volumes:
      - name: systemd
        hostPath:
          path: /run/systemd
      - name: rootfs
        hostPath:
          path: /
      - name: docker-socket
        hostPath:
          path: /var/run/docker.sock
      hostNetwork: true
      hostPID: true

常见 NTP 服务名称对照表

发行版	服务名称	检查命令
CentOS/RHEL 7	`ntpd.service`	`systemctl status ntpd`
CentOS/RHEL 8+	`chronyd.service`	`systemctl status chronyd`
Ubuntu 16.04	`ntp.service`	`systemctl status ntp`
Ubuntu 18.04+	`systemd-timesyncd.service`	`systemctl status systemd-timesyncd`
Debian	`ntp.service` 或 `chrony.service`	`systemctl status ntp`

验证指标

配置更新后，等待 Pod 重启，然后验证：

# 检查指标
curl -s http://<node-exporter-pod-ip>:9100/metrics | grep node_ntp

# 应该看到类似以下指标：
# node_ntp_leap_seconds
# node_ntp_offset_seconds
# node_ntp_reference_timestamp_seconds
# node_ntp_rtt_seconds
# node_ntp_stratum
# node_ntp_sync_delay_seconds

注意事项

DaemonSet 更新后需要等待 Pod 重启：修改配置后，旧的 Pod 不会自动更新，需要删除 Pod 让 DaemonSet 重新创建。
不同节点可能有不同的服务名：如果集群节点使用不同的 Linux 发行版，可能需要使用更宽泛的正则表达式。
systemd collector 需要 rootfs 挂载：--path.rootfs=/host 是必需的，因为容器内的路径与主机不同。
权限问题：某些系统可能需要额外的安全上下文配置。

快速排查步骤

检查实际服务名（在节点上执行）：

systemctl list-units --type=service | grep -E "(ntp|chrony|time)"

检查 node-exporter 日志：

kubectl logs -n kube-system daemonset/node-exporter-arm64 | grep -i systemd

验证指标是否暴露：

POD_IP=$(kubectl get pod -n kube-system -l name=node-exporter-arm64 -o jsonpath='{.items[0].status.podIP}')
curl -s http://${POD_IP}:9100/metrics | grep -i ntp

建议的修复方案
如果集群节点使用不同的 NTP 服务，可以修改配置包含多个可能的服务名：

args:
  - --path.rootfs=/host
  - --collector.systemd
  - --collector.systemd.unit-include=^(kubelet|docker|ntpd|ntp|chronyd|chrony|systemd-timesyncd)\.service$

查询实际服务名就是ntpd.service，还是无法暴露指标
在这里插入图片描述

Node Exporter NTP 指标深度排查（服务名已确认）

当前状态

✅ 服务名已确认：ntpd.service
✅ 服务正在运行：active (running)
❌ 指标未暴露：node_ntp_* 指标不存在

排查步骤

1. 检查 DaemonSet 完整配置

# 查看完整的 DaemonSet 配置
kubectl get ds node-exporter-arm64 -n kube-system -o yaml > /tmp/ds-config.yaml

# 检查关键配置项
kubectl get ds node-exporter-arm64 -n kube-system -o yaml | grep -A 20 "volumeMounts:"
kubectl get ds node-exporter-arm64 -n kube-system -o yaml | grep -A 10 "volumes:"
kubectl get ds node-exporter-arm64 -n kube-system -o yaml | grep -A 5 "args:"

必须检查的配置项：

volumeMounts 必须包含：

volumeMounts:
- name: systemd
  mountPath: /host/run/systemd
  readOnly: true
- name: rootfs
  mountPath: /host
  readOnly: true

volumes 必须包含：

volumes:
- name: systemd
  hostPath:
    path: /run/systemd
- name: rootfs
  hostPath:
    path: /

args 必须包含：

args:
- --path.rootfs=/host
- --collector.systemd
- --collector.systemd.unit-include=^(kubelet|docker|ntpd)\.service$

2. 检查 node-exporter 日志

# 查看所有日志
kubectl logs -n kube-system -l name=node-exporter-arm64 --tail=100

# 查找 systemd 相关错误
kubectl logs -n kube-system -l name=node-exporter-arm64 | grep -i systemd
kubectl logs -n kube-system -l name=node-exporter-arm64 | grep -i error
kubectl logs -n kube-system -l name=node-exporter-arm64 | grep -i "collector"

# 检查是否有权限错误
kubectl logs -n kube-system -l name=node-exporter-arm64 | grep -i "permission\|denied\|access"

3. 检查容器内 systemd 访问

# 获取 Pod 名称
POD_NAME=$(kubectl get pod -n kube-system -l name=node-exporter-arm64 -o jsonpath='{.items[0].metadata.name}')

# 进入容器检查
kubectl exec -n kube-system $POD_NAME -- ls -la /host/run/systemd
kubectl exec -n kube-system $POD_NAME -- ls -la /host/etc/systemd/system/ | grep ntpd

# 检查是否能访问 systemd socket
kubectl exec -n kube-system $POD_NAME -- test -S /host/run/systemd/private && echo "Socket exists" || echo "Socket NOT found"

4. 检查指标端点

# 获取 Pod IP
POD_IP=$(kubectl get pod -n kube-system -l name=node-exporter-arm64 -o jsonpath='{.items[0].status.podIP}')

# 检查所有 systemd 相关指标
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -s http://${POD_IP}:9100/metrics | grep -i systemd

# 检查是否有任何 ntp 相关指标
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -s http://${POD_IP}:9100/metrics | grep -i ntp

# 检查 systemd collector 是否启用
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -s http://${POD_IP}:9100/metrics | grep node_systemd

# 列出所有可用的指标（查找 systemd 相关）
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -s http://${POD_IP}:9100/metrics | grep "^node_" | grep -i systemd

5. 检查容器实际运行的参数

# 检查 Pod 实际运行的 args
kubectl get pod -n kube-system -l name=node-exporter-arm64 -o jsonpath='{.items[0].spec.containers[0].args}' | jq

# 或者直接查看
kubectl get pod -n kube-system -l name=node-exporter-arm64 -o jsonpath='{.items[0].spec.containers[0].args}'

6. 测试 systemd collector 功能

# 在容器内测试是否能读取 systemd 信息
kubectl exec -n kube-system $POD_NAME -- sh -c "ls -la /host/run/systemd/system/ | head -20"

# 检查 systemd 私有 socket
kubectl exec -n kube-system $POD_NAME -- test -r /host/run/systemd/private && echo "Readable" || echo "NOT readable"

常见问题及解决方案

问题 1: 缺少 systemd 挂载

症状： 容器内 /host/run/systemd 不存在或为空

解决： 添加 volumeMounts 和 volumes（见完整配置示例）

问题 2: 权限不足

症状： 日志中出现 “permission denied” 或 “access denied”

解决： 可能需要添加 securityContext：

securityContext:
  runAsUser: 0
  runAsGroup: 0

问题 3: systemd socket 路径问题

症状： systemd socket 路径不正确

解决： 某些系统可能需要挂载 /run/systemd 而不是 /host/run/systemd，但需要配合 --path.rootfs=/host

问题 4: 时钟未同步警告

从图片中看到：kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

这不会阻止指标暴露，但可能影响指标值。如果指标完全不出现，这不是主要原因。

问题 5: 容器内路径问题

检查： node-exporter 使用 --path.rootfs=/host 时，所有路径都需要加上 /host 前缀

验证：

kubectl exec -n kube-system $POD_NAME -- ls /host/run/systemd

完整配置示例（修复版）

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter-arm64
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: node-exporter-arm64
  template:
    metadata:
      labels:
        app: prometheus
        name: node-exporter-arm64
    spec:
      containers:
      - name: node-exporter-arm64
        image: reg.xb-deeplearning.cn/dlaas/node-exporter-linux-arm64:latest
        imagePullPolicy: IfNotPresent
        args:
          - --path.rootfs=/host
          - --collector.systemd
          - --collector.systemd.unit-include=^(kubelet|docker|ntpd)\.service$
        ports:
        - name: metrics
          containerPort: 9100
          protocol: TCP
        volumeMounts:
        - name: systemd
          mountPath: /host/run/systemd
          readOnly: true
        - name: rootfs
          mountPath: /host
          readOnly: true
        resources:
          requests:
            cpu: 100m
            memory: 180Mi
          limits:
            cpu: 200m
            memory: 180Mi
        securityContext:
          runAsUser: 0
          runAsGroup: 0
          runAsNonRoot: false
      volumes:
      - name: systemd
        hostPath:
          path: /run/systemd
      - name: rootfs
        hostPath:
          path: /
      hostNetwork: true
      hostPID: true
      tolerations:
      - effect: NoSchedule
        operator: Exists

验证步骤

配置更新后：

删除旧 Pod 强制重建：

kubectl delete pod -n kube-system -l name=node-exporter-arm64

等待 Pod 启动：

kubectl get pod -n kube-system -l name=node-exporter-arm64 -w

验证指标：

POD_IP=$(kubectl get pod -n kube-system -l name=node-exporter-arm64 -o jsonpath='{.items[0].status.podIP}')
curl -s http://${POD_IP}:9100/metrics | grep node_ntp

应该看到：

node_ntp_leap_seconds
node_ntp_offset_seconds
node_ntp_reference_timestamp_seconds
node_ntp_rtt_seconds
node_ntp_stratum
node_ntp_sync_delay_seconds

调试脚本

创建一个快速检查脚本：

#!/bin/bash
POD_NAME=$(kubectl get pod -n kube-system -l name=node-exporter-arm64 -o jsonpath='{.items[0].metadata.name}')
POD_IP=$(kubectl get pod -n kube-system -l name=node-exporter-arm64 -o jsonpath='{.items[0].status.podIP}')

echo "=== 检查 Pod 配置 ==="
kubectl get pod -n kube-system $POD_NAME -o jsonpath='{.spec.containers[0].args}' | jq

echo -e "\n=== 检查挂载点 ==="
kubectl exec -n kube-system $POD_NAME -- ls -la /host/run/systemd 2>&1 | head -5

echo -e "\n=== 检查 systemd socket ==="
kubectl exec -n kube-system $POD_NAME -- test -S /host/run/systemd/private && echo "✅ Socket exists" || echo "❌ Socket NOT found"

echo -e "\n=== 检查日志中的 systemd ==="
kubectl logs -n kube-system $POD_NAME | grep -i systemd | tail -5

echo -e "\n=== 检查指标 ==="
curl -s http://${POD_IP}:9100/metrics | grep -E "(node_ntp|node_systemd)" | head -10

kubectl logs -n kube-system -l name=node-exporter-arm64 --tail=20
在这里插入图片描述
node-exporter 没有独立的 ntp collector。NTP 指标（node_ntp_*）是通过 systemd collector 收集 ntpd.service 的状态后暴露的。日志中看到 collector=systemd 表示 systemd collector 已启用。

# 1. 检查是否有 systemd 指标
POD_IP=$(kubectl get pod -n kube-system -l name=node-exporter-arm64 -o jsonpath='{.items[0].status.podIP}')
curl -s http://${POD_IP}:9100/metrics | grep "^node_systemd" | head -10

# 2. 检查是否有 ntpd.service 的 systemd 指标
curl -s http://${POD_IP}:9100/metrics | grep "ntpd.service"

# 3. 检查是否有 node_ntp_* 指标
curl -s http://${POD_IP}:9100/metrics | grep "^node_ntp"