跳到主要内容

Kubernetes 集群监控

运维平台 1.3.0 起支持
与「在 Kubernetes 上部署运维平台」的区别

本文档讲的是用运维平台监控一个 Kubernetes 集群(采集节点、容器、Pod 指标)。如果你要把运维平台本身部署到 Kubernetes,请看「基于 Kubernetes 部署」(本目录 kubernetes 页)。

运维平台可采集 Kubernetes 集群的节点资源容器资源(cAdvisor)集群对象状态(kube-state-metrics),在前端「资源 → K8s」面板统一展示节点、Pod、Deployment、容器等监控视图。

根据运维平台自身的部署形态,二选一,两种方法都提供完整数据

方法一 · 集群外(static)方法二 · 集群内(incluster)
何时用运维平台以 Docker Compose 部署,与被监控集群分开(私有部署最常见)整套运维平台就部署在被监控集群内
采集方式ops-prometheus 用 token 远程抓取 NodePort + kubeletServiceAccount 自动发现,集群内直采
步骤量约 5 步约 2 步
是否需要 token需要不需要
网络要求ops 能访问节点 30080/30081/10250集群内互通即可

下文各步骤的 k8s/xxx.yaml 指采集组件 manifest,随运维平台离线包提供(位于 plugin/k8s-monitor/k8s/),完整内容见文末附:manifest 清单

方法一:集群外(static)

运维平台以 Docker Compose 部署在被监控集群外,远程抓取。标准配置即包含容器级 cAdvisor。

前提

ops 容器需能网络访问被监控集群各节点的 300803008110250 端口。

  1. 被监控的 Kubernetes 集群部署采集组件:

    kubectl apply \
    -f k8s/00-namespace.yaml \
    -f k8s/10-kube-state-metrics.yaml \
    -f k8s/20-node-exporter.yaml \
    -f k8s/15-metrics-reader.yaml
  2. 获取地址与 token:

    • kube-state-metrics:节点IP:30080
    • node_exporter:节点IP:30081
    • kubelet:节点IP:10250(Kubernetes 自带,无需部署)
    kubectl -n mdis-monitoring get secret mdis-metrics-reader-token \
    -o jsonpath='{.data.token}' | base64 -d
  3. ops-prometheus 配置环境变量(完整说明见本目录「环境变量」页):

    ENV_K8S_MONITOR_MODE:       "static"
    ENV_PROMETHEUS_K8S_KSM: "k8s/192.168.1.10:30080"
    ENV_PROMETHEUS_K8S_NODE: "n1/192.168.1.10:30081,n2/192.168.1.11:30081"
    ENV_PROMETHEUS_K8S_KUBELET: "n1/192.168.1.10:10250,n2/192.168.1.11:10250"
    ENV_K8S_BEARER_TOKEN: "<上一步获取的 token>"

    多节点时 NODE / KUBELET别名/IP:端口 逗号分隔,每节点一条。

  4. 重建容器使配置生效:

    docker compose -f ops.yaml down && docker compose -f ops.yaml up -d
  5. 验证:Prometheus /targetskube-state-metricsk8s-nodek8s-kubeletk8s-cadvisor 四个 job 均为 UP,前端「资源 → K8s」面板有数据。

仅需集群概览时

如果只要集群对象状态 + 节点资源、不需要每容器指标,可省略 15-metrics-reader.yamlENV_PROMETHEUS_K8S_KUBELET / ENV_K8S_BEARER_TOKEN。但 K8s 仪表盘多数面板是利用率口径(依赖容器指标),建议保留

方法二:集群内(incluster)

整套运维平台部署在被监控集群内时,ops-prometheus 作为 Pod 自动发现,零地址、零 token。

  1. 部署 kube-state-metrics 与集群内 Prometheus:

    kubectl apply \
    -f k8s/00-namespace.yaml \
    -f k8s/10-kube-state-metrics.yaml \
    -f k8s/30-prometheus-incluster.yaml
  2. 完成。ServiceAccount + apiserver proxy 自动发现 kubelet、cAdvisor、kube-state-metrics、pods,无需暴露 kubelet 端口、无需 kubeconfig。

  3. 验证

    kubectl -n mdis-monitoring get pods
    curl 节点IP:30090/prometheus/server/api/v1/targets

    k8s-kubelet / k8s-cadvisor / k8s-kube-state-metrics 均为 UP

采集组件清单

文件作用方法一方法二
00-namespace.yaml命名空间 mdis-monitoring
10-kube-state-metrics.yamlkube-state-metrics + RBAC + Service(NodePort 30080)
15-metrics-reader.yaml抓 cAdvisor 用的长效 token
20-node-exporter.yamlnode_exporter DaemonSet + Service(NodePort 30081)可选
30-prometheus-incluster.yaml集群内 ops-prometheus + RBAC + Service(NodePort 30090)

附:manifest 清单

00-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: mdis-monitoring
labels:
app.kubernetes.io/part-of: mdis-ops
10-kube-state-metrics.yaml
# kube-state-metrics:采集 k8s 对象状态(Deployment/Pod/PVC/Node…)
# 模式① static:客户 apply 本文件,把 Service 的 NodePort(30080) 地址配给 ops-prometheus 的 ENV_PROMETHEUS_K8S_KSM
# 模式② incluster:ops-prometheus 经 endpoints SD 自动发现本 Service
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: mdis-monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: mdis-kube-state-metrics
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- services
- serviceaccounts
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources: ["statefulsets", "daemonsets", "deployments", "replicasets"]
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources: ["cronjobs", "jobs"]
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["list", "watch"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["list", "watch"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses", "networkpolicies"]
verbs: ["list", "watch"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses", "volumeattachments"]
verbs: ["list", "watch"]
- apiGroups: ["certificates.k8s.io"]
resources: ["certificatesigningrequests"]
verbs: ["list", "watch"]
- apiGroups: ["admissionregistration.k8s.io"]
resources: ["mutatingwebhookconfigurations", "validatingwebhookconfigurations"]
verbs: ["list", "watch"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: mdis-kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: mdis-kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: mdis-monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: mdis-monitoring
labels:
app.kubernetes.io/name: kube-state-metrics
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: kube-state-metrics
template:
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-ksm:1.3.0
ports:
- name: http-metrics
containerPort: 8080
- name: telemetry
containerPort: 8081
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 5
timeoutSeconds: 5
resources:
requests:
cpu: 10m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi
---
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
namespace: mdis-monitoring
labels:
app.kubernetes.io/name: kube-state-metrics
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
type: NodePort
selector:
app.kubernetes.io/name: kube-state-metrics
ports:
- name: http-metrics
port: 8080
targetPort: 8080
nodePort: 30080
15-metrics-reader.yaml
# 仅 static 模式需要:给集群外的 ops-prometheus 一个带 nodes/proxy 权限的长效 token,
# 用于 token 认证抓 kubelet/cadvisor(容器级指标)。
# apply 后取 token:
# kubectl -n mdis-monitoring get secret mdis-metrics-reader-token \
# -o jsonpath='{.data.token}' | base64 -d
# 再填进 ops-prometheus 的 ENV_K8S_BEARER_TOKEN,并配 ENV_PROMETHEUS_K8S_KUBELET=别名/节点IP:10250
apiVersion: v1
kind: ServiceAccount
metadata:
name: mdis-metrics-reader
namespace: mdis-monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: mdis-metrics-reader
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/proxy", "nodes/metrics"]
verbs: ["get", "list"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: mdis-metrics-reader
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: mdis-metrics-reader
subjects:
- kind: ServiceAccount
name: mdis-metrics-reader
namespace: mdis-monitoring
---
# 长效 token(k8s 1.24+ SA 默认不再自动生成 token Secret,这里显式声明一个)
apiVersion: v1
kind: Secret
metadata:
name: mdis-metrics-reader-token
namespace: mdis-monitoring
annotations:
kubernetes.io/service-account.name: mdis-metrics-reader
type: kubernetes.io/service-account-token
20-node-exporter.yaml
# node_exporter DaemonSet:采集每个节点的主机资源指标,复用 ops-nodeagent 镜像(监听 :59100)
# 主要服务于模式① static:Service 的 NodePort(30081) 地址配给 ops-prometheus 的 ENV_PROMETHEUS_K8S_NODE
# 模式② incluster 下节点/容器指标已由 kubelet+cadvisor 覆盖,本文件可选
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: mdis-monitoring
labels:
app.kubernetes.io/name: node-exporter
spec:
selector:
matchLabels:
app.kubernetes.io/name: node-exporter
template:
metadata:
labels:
app.kubernetes.io/name: node-exporter
spec:
hostNetwork: true
hostPID: true
tolerations:
- operator: Exists # 覆盖控制面/污点节点
containers:
- name: node-exporter
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-nodeagent:1.2.0
ports:
- name: metrics
containerPort: 59100
hostPort: 59100
securityContext:
privileged: true
volumeMounts:
- name: rootfs
mountPath: /host
readOnly: true
resources:
requests:
cpu: 10m
memory: 32Mi
limits:
cpu: 200m
memory: 128Mi
volumes:
- name: rootfs
hostPath:
path: /
---
apiVersion: v1
kind: Service
metadata:
name: node-exporter
namespace: mdis-monitoring
labels:
app.kubernetes.io/name: node-exporter
spec:
type: NodePort
selector:
app.kubernetes.io/name: node-exporter
ports:
- name: metrics
port: 59100
targetPort: 59100
nodePort: 30081
30-prometheus-incluster.yaml
# 模式② incluster:ops-prometheus 作为 Pod 跑在被监控 k8s 内,
# 用 ServiceAccount + kubernetes_sd_configs 自动发现并抓取 kubelet/cadvisor/kube-state-metrics/pods。
# 仅 incluster 模式需要本文件;模式① static 不用。
apiVersion: v1
kind: ServiceAccount
metadata:
name: ops-prometheus
namespace: mdis-monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: mdis-ops-prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: mdis-ops-prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: mdis-ops-prometheus
subjects:
- kind: ServiceAccount
name: ops-prometheus
namespace: mdis-monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-prometheus
namespace: mdis-monitoring
labels:
app.kubernetes.io/name: ops-prometheus
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: ops-prometheus
template:
metadata:
labels:
app.kubernetes.io/name: ops-prometheus
spec:
serviceAccountName: ops-prometheus
containers:
- name: ops-prometheus
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-prometheus:1.3.0
env:
- name: ENV_K8S_MONITOR_MODE
value: "incluster"
- name: ENV_K8S_KSM_NAMESPACE
value: "mdis-monitoring"
- name: ENV_PROMETHEUS_RETENTION
value: "15d"
ports:
- name: web
containerPort: 9090
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: "1"
memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: ops-prometheus
namespace: mdis-monitoring
labels:
app.kubernetes.io/name: ops-prometheus
spec:
type: NodePort
selector:
app.kubernetes.io/name: ops-prometheus
ports:
- name: web
port: 9090
targetPort: 9090
nodePort: 30090