Kubernetes 集群监控
本文档讲的是用运维平台监控一个 Kubernetes 集群(采集节点、容器、Pod 指标)。如果你要把运维平台本身部署到 Kubernetes,请看「基于 Kubernetes 部署」(本目录 kubernetes 页)。
运维平台可采集 Kubernetes 集群的节点资源、容器资源(cAdvisor)和集群对象状态(kube-state-metrics),在前端「资源 → K8s」面板统一展示节点、Pod、Deployment、容器等监控视图。
根据运维平台自身的部署形态,二选一,两种方法都提供完整数据:
| 方法一 · 集群外(static) | 方法二 · 集群内(incluster) | |
|---|---|---|
| 何时用 | 运维平台以 Docker Compose 部署,与被监控集群分开(私有部署最常见) | 整套运维平台就部署在被监控集群内 |
| 采集方式 | ops-prometheus 用 token 远程抓取 NodePort + kubelet | ServiceAccount 自动发现,集群内直采 |
| 步骤量 | 约 5 步 | 约 2 步 |
| 是否需要 token | 需要 | 不需要 |
| 网络要求 | ops 能访问节点 30080/30081/10250 | 集群内互通即可 |
下文各步骤的
k8s/xxx.yaml指采集组件 manifest,随运维平台离线包提供(位于plugin/k8s-monitor/k8s/),完整内容见文末附:manifest 清单。
方法一:集群外(static)
运维平台以 Docker Compose 部署在被监控集群外,远程抓取。标准配置即包含容器级 cAdvisor。
ops 容器需能网络访问被监控集群各节点的 30080、30081、10250 端口。
-
在被监控的 Kubernetes 集群部署采集组件:
kubectl apply \
-f k8s/00-namespace.yaml \
-f k8s/10-kube-state-metrics.yaml \
-f k8s/20-node-exporter.yaml \
-f k8s/15-metrics-reader.yaml -
获取地址与 token:
- kube-state-metrics:
节点IP:30080 - node_exporter:
节点IP:30081 - kubelet:
节点IP:10250(Kubernetes 自带,无需部署)
kubectl -n mdis-monitoring get secret mdis-metrics-reader-token \
-o jsonpath='{.data.token}' | base64 -d - kube-state-metrics:
-
在
ops-prometheus配置环境变量(完整说明见本目录「环境变量」页):ENV_K8S_MONITOR_MODE: "static"
ENV_PROMETHEUS_K8S_KSM: "k8s/192.168.1.10:30080"
ENV_PROMETHEUS_K8S_NODE: "n1/192.168.1.10:30081,n2/192.168.1.11:30081"
ENV_PROMETHEUS_K8S_KUBELET: "n1/192.168.1.10:10250,n2/192.168.1.11:10250"
ENV_K8S_BEARER_TOKEN: "<上一步获取的 token>"多节点时
NODE/KUBELET按别名/IP:端口逗号分隔,每节点一条。 -
重建容器使配置生效:
docker compose -f ops.yaml down && docker compose -f ops.yaml up -d -
验证:Prometheus
/targets中kube-state-metrics、k8s-node、k8s-kubelet、k8s-cadvisor四个 job 均为UP,前端「资源 → K8s」面板有数据。
如果只要集群对象状态 + 节点资源、不需要每容器指标,可省略 15-metrics-reader.yaml 及 ENV_PROMETHEUS_K8S_KUBELET / ENV_K8S_BEARER_TOKEN。但 K8s 仪表盘多数面板是利用率口径(依赖容器指标),建议保留。
方 法二:集群内(incluster)
整套运维平台部署在被监控集群内时,ops-prometheus 作为 Pod 自动发现,零地址、零 token。
-
部署 kube-state-metrics 与集群内 Prometheus:
kubectl apply \
-f k8s/00-namespace.yaml \
-f k8s/10-kube-state-metrics.yaml \
-f k8s/30-prometheus-incluster.yaml -
完成。ServiceAccount + apiserver proxy 自动发现 kubelet、cAdvisor、kube-state-metrics、pods,无需暴露 kubelet 端口、无需 kubeconfig。
-
验证:
kubectl -n mdis-monitoring get pods
curl 节点IP:30090/prometheus/server/api/v1/targetsk8s-kubelet/k8s-cadvisor/k8s-kube-state-metrics均为UP。
采集组件清单
| 文件 | 作用 | 方法一 | 方法二 |
|---|---|---|---|
00-namespace.yaml | 命名空间 mdis-monitoring | ✅ | ✅ |
10-kube-state-metrics.yaml | kube-state-metrics + RBAC + Service(NodePort 30080) | ✅ | ✅ |
15-metrics-reader.yaml | 抓 cAdvisor 用的长效 token | ✅ | — |
20-node-exporter.yaml | node_exporter DaemonSet + Service(NodePort 30081) | ✅ | 可选 |
30-prometheus-incluster.yaml | 集群内 ops-prometheus + RBAC + Service(NodePort 30090) | — | ✅ |
附:manifest 清单
00-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: mdis-monitoring
labels:
app.kubernetes.io/part-of: mdis-ops
10-kube-state-metrics.yaml
# kube-state-metrics:采集 k8s 对象状态(Deployment/Pod/PVC/Node…)
# 模式① static:客户 apply 本文件,把 Service 的 NodePort(30080) 地址配给 ops-prometheus 的 ENV_PROMETHEUS_K8S_KSM
# 模式② incluster:ops-prometheus 经 endpoints SD 自动发现本 Service
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: mdis-monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: mdis-kube-state-metrics
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- services
- serviceaccounts
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources: ["statefulsets", "daemonsets", "deployments", "replicasets"]
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources: ["cronjobs", "jobs"]
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["list", "watch"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["list", "watch"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses", "networkpolicies"]
verbs: ["list", "watch"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses", "volumeattachments"]
verbs: ["list", "watch"]
- apiGroups: ["certificates.k8s.io"]
resources: ["certificatesigningrequests"]
verbs: ["list", "watch"]
- apiGroups: ["admissionregistration.k8s.io"]
resources: ["mutatingwebhookconfigurations", "validatingwebhookconfigurations"]
verbs: ["list", "watch"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: mdis-kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: mdis-kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: mdis-monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: mdis-monitoring
labels:
app.kubernetes.io/name: kube-state-metrics
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
template:
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-ksm:1.3.0
ports:
- name: http-metrics
containerPort: 8080
- name: telemetry
containerPort: 8081
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 5
timeoutSeconds: 5
resources:
requests:
cpu: 10m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi
---
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
namespace: mdis-monitoring
labels:
app.kubernetes.io/name: kube-state-metrics
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
type: NodePort
selector:
app.kubernetes.io/name: kube-state-metrics
ports:
- name: http-metrics
port: 8080
targetPort: 8080
nodePort: 30080
15-metrics-reader.yaml
# 仅 static 模式需要:给集群外的 ops-prometheus 一个带 nodes/proxy 权限的长效 token,
# 用于 token 认证抓 kubelet/cadvisor(容器级指标)。
# apply 后取 token:
# kubectl -n mdis-monitoring get secret mdis-metrics-reader-token \
# -o jsonpath='{.data.token}' | base64 -d
# 再填进 ops-prometheus 的 ENV_K8S_BEARER_TOKEN,并配 ENV_PROMETHEUS_K8S_KUBELET=别名/节点IP:10250
apiVersion: v1
kind: ServiceAccount
metadata:
name: mdis-metrics-reader
namespace: mdis-monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: mdis-metrics-reader
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/proxy", "nodes/metrics"]
verbs: ["get", "list"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: mdis-metrics-reader
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: mdis-metrics-reader
subjects:
- kind: ServiceAccount
name: mdis-metrics-reader
namespace: mdis-monitoring
---
# 长效 token(k8s 1.24+ SA 默认不再自动生成 token Secret,这里显式声明一个)
apiVersion: v1
kind: Secret
metadata:
name: mdis-metrics-reader-token
namespace: mdis-monitoring
annotations:
kubernetes.io/service-account.name: mdis-metrics-reader
type: kubernetes.io/service-account-token
20-node-exporter.yaml
# node_exporter DaemonSet:采集每个节点的主机资源指标,复用 ops-nodeagent 镜像(监听 :59100)
# 主要服务于模式① static:Service 的 NodePort(30081) 地址配给 ops-prometheus 的 ENV_PROMETHEUS_K8S_NODE
# 模式② incluster 下节点/容器指标已由 kubelet+cadvisor 覆盖,本文件可选
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: mdis-monitoring
labels:
app.kubernetes.io/name: node-exporter
spec:
selector:
matchLabels:
app.kubernetes.io/name: node-exporter
template:
metadata:
labels:
app.kubernetes.io/name: node-exporter
spec:
hostNetwork: true
hostPID: true
tolerations:
- operator: Exists # 覆盖控制面/污点节点
containers:
- name: node-exporter
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-nodeagent:1.2.0
ports:
- name: metrics
containerPort: 59100
hostPort: 59100
securityContext:
privileged: true
volumeMounts:
- name: rootfs
mountPath: /host
readOnly: true
resources:
requests:
cpu: 10m
memory: 32Mi
limits:
cpu: 200m
memory: 128Mi
volumes:
- name: rootfs
hostPath:
path: /
---
apiVersion: v1
kind: Service
metadata:
name: node-exporter
namespace: mdis-monitoring
labels:
app.kubernetes.io/name: node-exporter
spec:
type: NodePort
selector:
app.kubernetes.io/name: node-exporter
ports:
- name: metrics
port: 59100
targetPort: 59100
nodePort: 30081
30-prometheus-incluster.yaml
# 模式② incluster:ops-prometheus 作为 Pod 跑在被监控 k8s 内,
# 用 ServiceAccount + kubernetes_sd_configs 自动发现并抓取 kubelet/cadvisor/kube-state-metrics/pods。
# 仅 incluster 模式需要本文件;模式① static 不用。
apiVersion: v1
kind: ServiceAccount
metadata:
name: ops-prometheus
namespace: mdis-monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: mdis-ops-prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: mdis-ops-prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: mdis-ops-prometheus
subjects:
- kind: ServiceAccount
name: ops-prometheus
namespace: mdis-monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-prometheus
namespace: mdis-monitoring
labels:
app.kubernetes.io/name: ops-prometheus
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: ops-prometheus
template:
metadata:
labels:
app.kubernetes.io/name: ops-prometheus
spec:
serviceAccountName: ops-prometheus
containers:
- name: ops-prometheus
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-prometheus:1.3.0
env:
- name: ENV_K8S_MONITOR_MODE
value: "incluster"
- name: ENV_K8S_KSM_NAMESPACE
value: "mdis-monitoring"
- name: ENV_PROMETHEUS_RETENTION
value: "15d"
ports:
- name: web
containerPort: 9090
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: "1"
memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: ops-prometheus
namespace: mdis-monitoring
labels:
app.kubernetes.io/name: ops-prometheus
spec:
type: NodePort
selector:
app.kubernetes.io/name: ops-prometheus
ports:
- name: web
port: 9090
targetPort: 9090
nodePort: 30090