跳到主要内容

基于 Kubernetes 标准版集群以上模式

概述

本文档详细介绍如何在 Kubernetes 集群中部署多节点架构的运维平台,包括 Node Exporter 部署、镜像准备、节点标签配置、服务定义及访问设置等内容。主要适用于标准版集群以上(包含标准版集群),组件是多副本的情况。

部署 Node Exporter

当运维平台部署于 Kubernetes 集群内时,ops-nodeagent 容器内置了 Node Exporter 服务,以 DaemonSet 资源形式运行在 Kubernetes 集群各节点,自动实现集群内所有节点的监控数据采集。

如需监控 Kubernetes 集群外部的服务器,需在外部服务器单独部署 Node Exporter

运维平台部署

步骤一:下载镜像(离线包下载

crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-gateway:1.1.0
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-prometheus:1.1.0
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-agent:1.1.0
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-nodeagent:1.0.0

说明

  • ops-gatewayops-prometheusops-agent 镜像仅需在部署运维平台的节点下载
  • ops-nodeagent 镜像需要在 Kubernetes 集群内每个节点都下载,因为它以 DaemonSet 形式运行

步骤二:创建节点标签与污点

配置部署节点标签

默认通过 nodeSelector 将运维平台部署在固定节点上,ops-prometheus 服务中的监控数据默认通过 hostPath 方式持久化存储在本地磁盘。

为部署运维平台服务的节点创建 hap-ops=true 标签:

kubectl label node <节点名称> hap-ops=true

提示:将 <节点名称> 替换为实际要部署运维平台服务的节点名称,可以通过 kubectl get node -o wide 命令查看集群中的节点名称。

配置监控专用节点(可选)

如果需要新增专用的监控 worker 节点,需为新节点创建污点以确保只有运维平台组件可以调度到该节点:

kubectl taint nodes <节点名称> hap-ops=true:NoSchedule

注意:如果添加了上述污点,还需要在 ops.yaml 文件中为所有部署组件添加对应的容忍(tolerations)配置:

# 在每个 Deployment 的 spec.template.spec 部分添加
# 取消以下注释并根据需要调整
# tolerations:
# - key: "hap-ops"
# operator: "Equal"
# value: "true"
# effect: "NoSchedule"

步骤三:创建运维平台服务 Yaml 文件

准备配置文件目录

mkdir -p /data/mingdao/script/kubernetes/ops/
cd /data/mingdao/script/kubernetes/ops/

创建主配置文件 ops.yaml

创建 ops.yaml 文件,该配置定义了运维平台的完整组件架构:

  • ops-agent-1:负责监控 mongodb-1、elasticsearch-1、kafka-1 以及其他共享组件
  • ops-agent-2:负责监控 mongodb-2、elasticsearch-2、kafka-2
  • ops-agent-3:负责监控 mongodb-3、elasticsearch-3、kafka-3
cat > ops.yaml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: ops-config
namespace: hap-ops
data:
TZ: "Asia/Shanghai"
ENV_OPS_TOKEN: "SS9PobGG7SDTpcyfSZ1VVmn3gCmy2P52tYk" # 首次部署务必修改,此为运维平台访问认证密钥
ENV_PROMETHEUS_HOST: "svc_01/192.168.1.5:59100,svc_02/192.168.1.6:59100,svc_03/192.168.1.7:59100" # 替换为实际的 Node Exporter 服务地址
ENV_PROMETHEUS_SERVER: "http://ops-prometheus:9090"
ENV_PROMETHEUS_GRAFANA: "http://ops-prometheus:3000"
ENV_PROMETHEUS_ALERT: "http://ops-prometheus:9093"
ENV_PROMETHEUS_KARMA: "http://ops-prometheus:8080"
ENV_PROMETHEUS_KAFKA: "kafka_1/ops-agent-1:9308,kafka_2/ops-agent-2:9308,kafka_3/ops-agent-3:9308"
ENV_PROMETHEUS_ELASTICSEARCH: "elasticsearch_1/ops-agent-1:9114,elasticsearch_2/ops-agent-2:9114,elasticsearch_3/ops-agent-3:9114"
ENV_PROMETHEUS_REDIS: "redis_1/ops-agent-1:9121"
ENV_PROMETHEUS_MONGODB: "mongodb_1/ops-agent-1:9216,mongodb_2/ops-agent-2:9216,mongodb_3/ops-agent-3:9216"
ENV_PROMETHEUS_MYSQL: "mysql_1/ops-agent-1:9104"
# 以下是存储组件的连接信息,部署时按照实际环境情况进行修改
ENV_MYSQL_HOST: "192.168.1.7"
ENV_MYSQL_PORT: "3306"
ENV_MYSQL_USERNAME: "root"
ENV_MYSQL_PASSWORD: "changeme"
ENV_MONGODB_URI: "mongodb://root:changeme@192.168.1.9:27017,192.168.1.10:27017,192.168.1.11:27017" # 配置 ops-gateway 服务收集 mongodb-agent 指标数据
ENV_MONGODB_OPTIONS: "?authSource=admin"
ENV_REDIS_HOST: "192.168.1.8"
ENV_REDIS_PORT: "6379"
ENV_REDIS_PASSWORD: "changeme"
ENV_FLINK_URL: "http://flink-jobmanager.flink:8081"
ENV_PROMETHEUS_RETENTION: "30d" # Prometheus 数据保留天数,默认 15d(不配置时)

---
apiVersion: v1
kind: ConfigMap
metadata:
name: ops-config-agent-1 # 第一个agent的专属配置
namespace: hap-ops
data:
ENV_MONGODB_URI: "mongodb://root:changeme@192.168.1.9:27017" # 第一个mongodb节点的连接地址
ENV_MONGODB_OPTIONS: "?authSource=admin"
ENV_KAFKA_ENDPOINTS: "192.168.1.12:9092" # 第一个kafka节点的连接地址
ENV_ELASTICSEARCH_ENDPOINTS: "http://192.168.1.12:9200" # 第一个elasticsearch节点的连接地址
ENV_ELASTICSEARCH_PASSWORD: "elastic:changeme" # 第一个elasticsearch节点的账号密码

---
apiVersion: v1
kind: ConfigMap
metadata:
name: ops-config-agent-2 # 第二个agent的专属配置
namespace: hap-ops
data:
ENV_MONGODB_URI: "mongodb://root:changeme@192.168.1.10:27017" # 第二个mongodb节点的连接地址
ENV_MONGODB_OPTIONS: "?authSource=admin"
ENV_KAFKA_ENDPOINTS: "192.168.1.13:9092" # 第二个kafka节点的连接地址
ENV_ELASTICSEARCH_ENDPOINTS: "http://192.168.1.13:9200" # 第二个elasticsearch节点的连接地址
ENV_ELASTICSEARCH_PASSWORD: "elastic:changeme" # 第二个elasticsearch节点的账号密码

---
apiVersion: v1
kind: ConfigMap
metadata:
name: ops-config-agent-3 # 第三个agent的专属配置
namespace: hap-ops
data:
ENV_MONGODB_URI: "mongodb://root:changeme@192.168.1.11:27017" # 第三个mongodb节点的连接地址
ENV_MONGODB_OPTIONS: "?authSource=admin"
ENV_KAFKA_ENDPOINTS: "192.168.1.14:9092" # 第三个kafka节点的连接地址
ENV_ELASTICSEARCH_ENDPOINTS: "http://192.168.1.14:9200" # 第三个elasticsearch节点的连接地址
ENV_ELASTICSEARCH_PASSWORD: "elastic:changeme" # 第三个elasticsearch节点的账号密码

---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-gateway
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-gateway
template:
metadata:
labels:
app: ops-gateway
spec:
# 如配置了节点污点,取消以下注释
# tolerations:
# - key: "hap-ops"
# operator: "Equal"
# value: "true"
# effect: "NoSchedule"
nodeSelector:
hap-ops: "true"
containers:
- name: ops-gateway
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-gateway:1.1.0
envFrom:
- configMapRef:
name: ops-config
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"

---

apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-prometheus
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-prometheus
template:
metadata:
labels:
app: ops-prometheus
spec:
# 如配置了节点污点,取消以下注释
# tolerations:
# - key: "hap-ops"
# operator: "Equal"
# value: "true"
# effect: "NoSchedule"
nodeSelector:
hap-ops: "true"
containers:
- name: ops-prometheus
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-prometheus:1.1.0
volumeMounts:
- mountPath: /data/
name: prometheus-data
envFrom:
- configMapRef:
name: ops-config
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"
volumes:
- name: prometheus-data
hostPath:
path: /data/ops-prometheus-data # 持久化存储路径
type: DirectoryOrCreate # 如果目录不存在则创建

---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-agent-1
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-agent-1
template:
metadata:
labels:
app: ops-agent-1
spec:
# 如配置了节点污点,取消以下注释
# tolerations:
# - key: "hap-ops"
# operator: "Equal"
# value: "true"
# effect: "NoSchedule"
nodeSelector:
hap-ops: "true"
containers:
- name: ops-agent-1
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-agent:1.1.0
envFrom:
- configMapRef:
name: ops-config
- configMapRef:
name: ops-config-agent-1 # 专属配置(覆盖公共配置)
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "0.05"
memory: 128Mi

---

apiVersion: v1
kind: Service
metadata:
name: ops-agent-1
namespace: hap-ops
spec:
selector:
app: ops-agent-1
ports:
- name: prometheus
port: 9104
targetPort: 9104
- name: mongodb
port: 9216
targetPort: 9216
- name: redis
port: 9121
targetPort: 9121
- name: kafka
port: 9308
targetPort: 9308
- name: elasticsearch
port: 9114
targetPort: 9114
type: ClusterIP

---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-agent-2
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-agent-2
template:
metadata:
labels:
app: ops-agent-2
spec:
# 如配置了节点污点,取消以下注释
# tolerations:
# - key: "hap-ops"
# operator: "Equal"
# value: "true"
# effect: "NoSchedule"
nodeSelector:
hap-ops: "true"
containers:
- name: ops-agent-2
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-agent:1.1.0
envFrom:
- configMapRef:
name: ops-config-agent-2 # 专属配置(覆盖公共配置)
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "0.05"
memory: 128Mi
---

apiVersion: v1
kind: Service
metadata:
name: ops-agent-2
namespace: hap-ops
spec:
selector:
app: ops-agent-2
ports:
- name: mongodb
port: 9216
targetPort: 9216
- name: kafka
port: 9308
targetPort: 9308
- name: elasticsearch
port: 9114
targetPort: 9114
type: ClusterIP


---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-agent-3
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-agent-3
template:
metadata:
labels:
app: ops-agent-3
spec:
# 如配置了节点污点,取消以下注释
# tolerations:
# - key: "hap-ops"
# operator: "Equal"
# value: "true"
# effect: "NoSchedule"
nodeSelector:
hap-ops: "true"
containers:
- name: ops-agent-3
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-agent:1.1.0
envFrom:
- configMapRef:
name: ops-config-agent-3 # 专属配置(覆盖公共配置)
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "0.05"
memory: 128Mi
---

apiVersion: v1
kind: Service
metadata:
name: ops-agent-3
namespace: hap-ops
spec:
selector:
app: ops-agent-3
ports:
- name: mongodb
port: 9216
targetPort: 9216
- name: kafka
port: 9308
targetPort: 9308
- name: elasticsearch
port: 9114
targetPort: 9114
type: ClusterIP
---

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ops-nodeagent
namespace: hap-ops
spec:
selector:
matchLabels:
app: ops-nodeagent
template:
metadata:
labels:
app: ops-nodeagent
spec:
containers:
- name: ops-nodeagent
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-nodeagent:1.0.0
envFrom:
- configMapRef:
name: ops-config
ports:
- containerPort: 59100
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"
volumeMounts:
- name: host-root
mountPath: /host
readOnly: true
mountPropagation: HostToContainer
volumes:
- name: host-root
hostPath:
path: /
hostNetwork: true # 使用宿主机网络
hostPID: true # 使用宿主机 PID 命名空间

---

apiVersion: v1
kind: Service
metadata:
name: ops-prometheus
namespace: hap-ops
spec:
selector:
app: ops-prometheus
ports:
- name: server
port: 9090
targetPort: 9090
- name: grafana
port: 3000
targetPort: 3000
type: ClusterIP

---

apiVersion: v1
kind: Service
metadata:
name: ops-gateway
namespace: hap-ops
spec:
selector:
app: ops-gateway
ports:
- name: gateway
port: 48881
targetPort: 48881
nodePort: 30081
type: NodePort
EOF

步骤四:创建命名空间并启动服务

创建命名空间

kubectl create ns hap-ops

说明:运维平台默认部署在 hap-ops 命名空间中。

启动运维平台服务

kubectl apply -f ops.yaml

提示

  • 如需停止服务,可执行:kubectl delete -f ops.yaml
  • 部署过程中建议密切关注 Pod 状态,确保所有组件正常启动

步骤五:检查运维平台服务状态

kubectl -n hap-ops get pod -o wide

验证标准:所有 Pod 的 READY 列应显示为 1/1 状态,表示组件正常运行。

步骤六:配置 Nginx 反向代理

为了方便访问运维平台,建议配置 Nginx 反向代理:

cat > hap-ops.conf << 'EOF'
upstream hap-ops {
server 172.29.202.34:30081; # 替换为部署运维平台 K8S 节点的IP
}

server {
listen 48881;
server_name _;
access_log /data/logs/weblogs/hap-ops.log main;
error_log /data/logs/weblogs/hap-ops.error.log;

underscores_in_headers on;
client_max_body_size 2048m;
gzip on;
gzip_proxied any;
gzip_disable "msie6";
gzip_vary on;
gzip_min_length 512;
gzip_comp_level 6;
gzip_buffers 16 8k;
gzip_types text/plain text/css application/json application/x-javascript application/javascript application/octet-stream text/xml application/xml application/xml+rss text/javascript image/jpeg image/gif image/png;

location / {
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Host $http_host;
proxy_pass http://hap-ops;
}
}
EOF

说明:推荐访问入口使用 48881 端口,保持与运维平台后端固定端口一致。配置完成后,需将此配置文件放置在 Nginx 的配置目录下并重启 Nginx 服务。

步骤七:访问运维平台

以上述 Nginx 代理为例,访问 Nginx 入口:

http://hap-ops.demo.com:48881
  • 登录 Token 为 ops.yaml 中的 ENV_OPS_TOKEN 环境变量值