基于 Kubernetes 标准版集群以上模式
当前 K8s 部署文档仅包含核心指标监控组件(ops-gateway / ops-prometheus / ops-agent / ops-nodeagent)。
v1.2.0 引入的下列功能在 K8s 模式下的部署 manifest 正在完善中:
- 日志查询(基于 Loki + Alloy
loki.source.docker)—— Alloy 的 Docker socket 发现机制在 K8s 容器抽象下不可用,需改用discovery.kubernetes模式 - 分布式链路追踪(基于 Tempo)
- 统一可观测性界面(基于 Grafana 12)
临时方案:
- 继续使用 Docker Compose 模式部署完整 1.2.x 监控栈
- 或在 K8s 集群外的服务器上用 Docker 单独部署
ops-alloy/ops-loki/ops-tempo/ops-grafana,并通过环境变量配置网络互通
完整 K8s manifest 计划在后续版本提供。当前若按本文档部署,得到的是 v1.1.x 范围内的功能(指标监控 + MongoDB 慢查询 + 告警)。
概述
本文档详细介绍如何在 Kubernetes 集群中部署多节点架构的运维平台,包括 Node Exporter 部署、镜像准备、节点标签配置、服务定义及访问设置等内容。主要适用于标准版集群以上(包含标准版集群),组件是多副本的情况。
部署 Node Exporter
当运维平台部署于 Kubernetes 集群内时,ops-nodeagent 容器内置了 Node Exporter 服务,以 DaemonSet 资源形式运行在 Kubernetes 集群各节点,自动实现集群内所有节点的监控数据采集。
如需监控 Kubernetes 集群外部的服务器,需在外部服务器单独部署 Node Exporter。
运维平台部署
步骤一:下载镜像(离线包下载)
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-gateway:1.1.0
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-prometheus:1.1.0
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-agent:1.1.0
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-nodeagent:1.0.0
说明:
ops-gateway、ops-prometheus、ops-agent镜像仅需在部署运维平台的节点下载ops-nodeagent镜 像需要在 Kubernetes 集群内每个节点都下载,因为它以 DaemonSet 形式运行
步骤二:创建节点标签与污点
配置部署节点标签
默认通过 nodeSelector 将运维平台部署在固定节点上,ops-prometheus 服务中的监控数据默认通过 hostPath 方式持久化存储在本地磁盘。
为部署运维平台服务的节点创建 hap-ops=true 标签:
kubectl label node <节点名称> hap-ops=true
提示:将 <节点名称> 替换为实际要部署运维平台服务的节点名称,可以通过 kubectl get node -o wide 命令查看集群中的节点名称。
配置监控专用节点(可选)
如果需要新增专用的监控 worker 节点,需为新节点创建污点以确保只有运维平台组件可以调度到该节点:
kubectl taint nodes <节点名称> hap-ops=true:NoSchedule
注意:如果添加了上述污点,还需要在 ops.yaml 文件中为所有部署组件添加对应的容忍(tolerations)配置:
# 在每个 Deployment 的 spec.template.spec 部分添加
# 取消以下注释并根据需要调整
# tolerations:
# - key: "hap-ops"
# operator: "Equal"
# value: "true"
# effect: "NoSchedule"
步骤三:创建运维平台服务 Yaml 文件
准备配置文件目录
mkdir -p /data/mingdao/script/kubernetes/ops/
cd /data/mingdao/script/kubernetes/ops/
创建主配置文件 ops.yaml
创建 ops.yaml 文件,该配置定义了运维平台的完整组件架构:
ops-agent-1:负责监控 mongodb-1、elasticsearch-1、kafka-1 以及其他共享组件ops-agent-2:负责监控 mongodb-2、elasticsearch-2、kafka-2ops-agent-3:负责监控 mongodb-3、elasticsearch-3、kafka-3
cat > ops.yaml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: ops-config
namespace: hap-ops
data:
TZ: "Asia/Shanghai"
ENV_OPS_TOKEN: "SS9PobGG7SDTpcyfSZ1VVmn3gCmy2P52tYk" # 首次部署 务必修改,此为运维平台访问认证密钥
ENV_PROMETHEUS_HOST: "svc_01/192.168.1.5:59100,svc_02/192.168.1.6:59100,svc_03/192.168.1.7:59100" # 替换为实际的 Node Exporter 服务地址
ENV_PROMETHEUS_SERVER: "http://ops-prometheus:9090"
ENV_PROMETHEUS_GRAFANA: "http://ops-prometheus:3000"
ENV_PROMETHEUS_ALERT: "http://ops-prometheus:9093"
ENV_PROMETHEUS_KARMA: "http://ops-prometheus:8080"
ENV_PROMETHEUS_KAFKA: "kafka_1/ops-agent-1:9308,kafka_2/ops-agent-2:9308,kafka_3/ops-agent-3:9308"
ENV_PROMETHEUS_ELASTICSEARCH: "elasticsearch_1/ops-agent-1:9114,elasticsearch_2/ops-agent-2:9114,elasticsearch_3/ops-agent-3:9114"
ENV_PROMETHEUS_REDIS: "redis_1/ops-agent-1:9121"
ENV_PROMETHEUS_MONGODB: "mongodb_1/ops-agent-1:9216,mongodb_2/ops-agent-2:9216,mongodb_3/ops-agent-3:9216"
ENV_PROMETHEUS_MYSQL: "mysql_1/ops-agent-1:9104"
# 以下是存储组件的连接信息,部署时按照实际环境情况进行修改
ENV_MYSQL_HOST: "192.168.1.7"
ENV_MYSQL_PORT: "3306"
ENV_MYSQL_USERNAME: "root"
ENV_MYSQL_PASSWORD: "changeme"
ENV_MONGODB_URI: "mongodb://root:changeme@192.168.1.9:27017,192.168.1.10:27017,192.168.1.11:27017" # 配置 ops-gateway 服务收集 mongodb-agent 指标数据
ENV_MONGODB_OPTIONS: "?authSource=admin"
ENV_REDIS_HOST: "192.168.1.8"
ENV_REDIS_PORT: "6379"
ENV_REDIS_PASSWORD: "changeme"
ENV_FLINK_URL: "http://flink-jobmanager.flink:8081"
ENV_PROMETHEUS_RETENTION: "30d" # Prometheus 数据保留天数,默认 15d(不配置时)
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ops-config-agent-1 # 第一个agent的专属配置
namespace: hap-ops
data:
ENV_MONGODB_URI: "mongodb://root:changeme@192.168.1.9:27017" # 第一个mongodb节点的连接地址
ENV_MONGODB_OPTIONS: "?authSource=admin"
ENV_KAFKA_ENDPOINTS: "192.168.1.12:9092" # 第一个kafka节点的连接地址
ENV_ELASTICSEARCH_ENDPOINTS: "http://192.168.1.12:9200" # 第一个elasticsearch节点的连接地址
ENV_ELASTICSEARCH_PASSWORD: "elastic:changeme" # 第一个elasticsearch节点的账号密码
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ops-config-agent-2 # 第二个agent的专属配置
namespace: hap-ops
data:
ENV_MONGODB_URI: "mongodb://root:changeme@192.168.1.10:27017" # 第二个mongodb节点的连接地址
ENV_MONGODB_OPTIONS: "?authSource=admin"
ENV_KAFKA_ENDPOINTS: "192.168.1.13:9092" # 第二个kafka节点的连接地址
ENV_ELASTICSEARCH_ENDPOINTS: "http://192.168.1.13:9200" # 第二个elasticsearch节点的连接地址
ENV_ELASTICSEARCH_PASSWORD: "elastic:changeme" # 第二个elasticsearch节点的账号密码
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ops-config-agent-3 # 第三个agent的专属配置
namespace: hap-ops
data:
ENV_MONGODB_URI: "mongodb://root:changeme@192.168.1.11:27017" # 第三个mongodb节点的连接地址
ENV_MONGODB_OPTIONS: "?authSource=admin"
ENV_KAFKA_ENDPOINTS: "192.168.1.14:9092" # 第三个kafka节点的连接地址
ENV_ELASTICSEARCH_ENDPOINTS: "http://192.168.1.14:9200" # 第三个elasticsearch节点的连接地址
ENV_ELASTICSEARCH_PASSWORD: "elastic:changeme" # 第三个elasticsearch节点的账号密码
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-gateway
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-gateway
template:
metadata:
labels:
app: ops-gateway
spec:
# 如配置了节点污点,取消以下注释
# tolerations:
# - key: "hap-ops"
# operator: "Equal"
# value: "true"
# effect: "NoSchedule"
nodeSelector:
hap-ops: "true"
containers:
- name: ops-gateway
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-gateway:1.1.0
envFrom:
- configMapRef:
name: ops-config
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-prometheus
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-prometheus
template:
metadata:
labels:
app: ops-prometheus
spec:
# 如配置了节点污点,取消以下注释
# tolerations:
# - key: "hap-ops"
# operator: "Equal"
# value: "true"
# effect: "NoSchedule"
nodeSelector:
hap-ops: "true"
containers:
- name: ops-prometheus
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-prometheus:1.1.0
volumeMounts:
- mountPath: /data/
name: prometheus-data
envFrom:
- configMapRef:
name: ops-config
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"
volumes:
- name: prometheus-data
hostPath:
path: /data/ops-prometheus-data # 持久化存储路径
type: DirectoryOrCreate # 如果目录不存在则创建
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-agent-1
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-agent-1
template:
metadata:
labels:
app: ops-agent-1
spec:
# 如配置了节点污点,取消以下注释
# tolerations:
# - key: "hap-ops"
# operator: "Equal"
# value: "true"
# effect: "NoSchedule"
nodeSelector:
hap-ops: "true"
containers:
- name: ops-agent-1
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-agent:1.1.0
envFrom:
- configMapRef:
name: ops-config
- configMapRef:
name: ops-config-agent-1 # 专属配置(覆盖公共配置)
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "0.05"
memory: 128Mi
---
apiVersion: v1
kind: Service
metadata:
name: ops-agent-1
namespace: hap-ops
spec:
selector:
app: ops-agent-1
ports:
- name: prometheus
port: 9104
targetPort: 9104
- name: mongodb
port: 9216
targetPort: 9216
- name: redis
port: 9121
targetPort: 9121
- name: kafka
port: 9308
targetPort: 9308
- name: elasticsearch
port: 9114
targetPort: 9114
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-agent-2
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-agent-2
template:
metadata:
labels:
app: ops-agent-2
spec:
# 如配置了节点污点,取消以下注释
# tolerations:
# - key: "hap-ops"
# operator: "Equal"
# value: "true"
# effect: "NoSchedule"
nodeSelector:
hap-ops: "true"
containers:
- name: ops-agent-2
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-agent:1.1.0
envFrom:
- configMapRef:
name: ops-config-agent-2 # 专属配置(覆盖公共配 置)
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "0.05"
memory: 128Mi
---
apiVersion: v1
kind: Service
metadata:
name: ops-agent-2
namespace: hap-ops
spec:
selector:
app: ops-agent-2
ports:
- name: mongodb
port: 9216
targetPort: 9216
- name: kafka
port: 9308
targetPort: 9308
- name: elasticsearch
port: 9114
targetPort: 9114
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-agent-3
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-agent-3
template:
metadata:
labels:
app: ops-agent-3
spec:
# 如配置了节点污点,取消以下注释
# tolerations:
# - key: "hap-ops"
# operator: "Equal"
# value: "true"
# effect: "NoSchedule"
nodeSelector:
hap-ops: "true"
containers:
- name: ops-agent-3
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-agent:1.1.0
envFrom:
- configMapRef:
name: ops-config-agent-3 # 专属配置(覆盖公共配置)
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "0.05"
memory: 128Mi
---
apiVersion: v1
kind: Service
metadata:
name: ops-agent-3
namespace: hap-ops
spec:
selector:
app: ops-agent-3
ports:
- name: mongodb
port: 9216
targetPort: 9216
- name: kafka
port: 9308
targetPort: 9308
- name: elasticsearch
port: 9114
targetPort: 9114
type: ClusterIP
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ops-nodeagent
namespace: hap-ops
spec:
selector:
matchLabels:
app: ops-nodeagent
template:
metadata:
labels:
app: ops-nodeagent
spec:
containers:
- name: ops-nodeagent
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-nodeagent:1.0.0
envFrom:
- configMapRef:
name: ops-config
ports:
- containerPort: 59100
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"
volumeMounts:
- name: host-root
mountPath: /host
readOnly: true
mountPropagation: HostToContainer
volumes:
- name: host-root
hostPath:
path: /
hostNetwork: true # 使用宿主机网络
hostPID: true # 使用宿主机 PID 命名空间
---
apiVersion: v1
kind: Service
metadata:
name: ops-prometheus
namespace: hap-ops
spec:
selector:
app: ops-prometheus
ports:
- name: server
port: 9090
targetPort: 9090
- name: grafana
port: 3000
targetPort: 3000
type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
name: ops-gateway
namespace: hap-ops
spec:
selector:
app: ops-gateway
ports:
- name: gateway
port: 48881
targetPort: 48881
nodePort: 30081
type: NodePort
EOF
步骤四:创建命名空间并启动服务
创建命名空间
kubectl create ns hap-ops
说明:运维平台默认部署在 hap-ops 命名空间中。
启动运维平台服务
kubectl apply -f ops.yaml
提示:
- 如需停止服务,可执行:
kubectl delete -f ops.yaml - 部署过程中建议密切关注 Pod 状态,确保所有组件正常启动
创建启停脚本
cd /data/mingdao/script/kubernetes/ops/
# 创建 start_ops.sh
cat > start_ops.sh << 'EOF'
#!/bin/bash
baseDir=$(dirname "$0")
kubectl apply -f $baseDir/ops.yaml
EOF
# 创建 stop_ops.sh
cat > stop_ops.sh << 'EOF'
#!/bin/bash
baseDir=$(dirname "$0")
kubectl delete -f $baseDir/ops.yaml
EOF
# 创建 restart_ops.sh
cat > restart_ops.sh << 'EOF'
#!/bin/bash
START_DATATIME=$(date +%Y%m%d%H%M%S)
baseDir=$(dirname "$0")
ns=hap-ops
count=120
grep_n=ops-prometheus
echo "Stopping ops-prometheus..."
/bin/bash "$baseDir/stop_ops.sh"
while [ $count -gt 0 ]; do
if ! kubectl -n "$ns" get pods | grep -v grep | grep -q $grep_n; then
echo "$grep_n stopped successfully."
/bin/bash "$baseDir/start_ops.sh"
break
fi
echo -n "."
count=$((count-1))
sleep 1
done
if [ $count -lt 0 ]; then
echo "Failed to stop ops-prometheus within 2 minutes."
exit 1
else
echo "$grep_n stopped and started successfully within 2 minutes."
fi
END_DATATIME=$(date +%Y%m%d%H%M%S)
ELAPSED_TIME=$((END_DATATIME - START_DATATIME))
echo "Total time elapsed: $ELAPSED_TIME seconds"
EOF
#### 赋权
chmod +x /data/mingdao/script/kubernetes/ops/{start_ops.sh,stop_ops.sh,restart_ops.sh}
步骤五:检查运维平台服务状态
kubectl -n hap-ops get pod -o wide
验证标准:所有 Pod 的 READY 列应显示为 1/1 状态,表示组件正常运行。
步骤六:配置 Nginx 反向代理
为了方便访问运维平台,建议配置 Nginx 反向代理:
cat > hap-ops.conf << 'EOF'
upstream hap-ops {
server 172.29.202.34:30081; # 替换为部署运维平台 K8S 节点的IP
}
server {
listen 48881;
server_name _;
access_log /data/logs/weblogs/hap-ops.log main;
error_log /data/logs/weblogs/hap-ops.error.log;
underscores_in_headers on;
client_max_body_size 2048m;
gzip on;
gzip_proxied any;
gzip_disable "msie6";
gzip_vary on;
gzip_min_length 512;
gzip_comp_level 6;
gzip_buffers 16 8k;
gzip_types text/plain text/css application/json application/x-javascript application/javascript application/octet-stream text/xml application/xml application/xml+rss text/javascript image/jpeg image/gif image/png;
location / {
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Host $http_host;
proxy_pass http://hap-ops;
}
}
EOF
说明:推荐访问入口使用 48881 端口,保持与运维平台后端固定端口一致。配置完成后,需将此配置文件放置在 Nginx 的配置目录下并重启 Nginx 服务。
步骤七:访问运维平台
以上述 Nginx 代理为例,访问 Nginx 入口:
http://hap-ops.demo.com:48881
- 登录 Token 为
ops.yaml中的ENV_OPS_TOKEN环境变量值