跳到主要内容

基于 Kubernetes 精简版集群模式

部署 Node Exporter

当运维平台部署于 Kubernetes 集群内时,ops-nodeagent 容器内置了 Node Exporter 服务,以 DaemonSet 资源形式运行在 Kubernetes 集群各节点。

如需监控 Kubernetes 集群外部的服务器,需在外部服务器部署 Node Exporter 后才能实现监控数据指标采集。

运维平台部署

下载镜像(离线包下载

crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-gateway:1.1.0
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-prometheus:1.1.0
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-agent:1.1.0
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-nodeagent:1.0.0
  • 默认情况下 ops-gatewayops-prometheusops-agent 镜像仅需在部署运维平台的节点下载,而 ops-nodeagent 镜像需要 Kubernetes 集群内每个节点都下载。

创建节点标签

默认通过 nodeSelector 将运维平台部署在固定一个节点上,ops-prometheus 服务中的监控数据默认通过 hostPath 方式持久化存储在本地磁盘。

为部署运维平台服务的节点创建 hap-ops=true 标签:

kubectl label node nodeName hap-ops=true
  • nodeName 换成要部署运维平台服务的节点名称,可以通过 kubectl get node -o wide 查看。

创建运维平台服务 Yaml 文件

创建与进入 Yaml 文件存放目录

mkdir -p /data/mingdao/script/kubernetes/ops/
cd /data/mingdao/script/kubernetes/ops/

创建 ops.yaml 文件

cat > ops.yaml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: ops-config
namespace: hap-ops
data:
TZ: "Asia/Shanghai"
ENV_OPS_TOKEN: "SS9PobGG7SDTpcyfSZ1VVmn3gCmy2P52tYk" # 首次部署务必调整,此环境变量值是后续运维平台的访问认证密钥
ENV_PROMETHEUS_HOST: "service_01/192.168.1.5:59100,service_02/192.168.1.6:59100,service_03/192.168.1.7:59100" # 部署时需要改成填写实际的 Node Exporter 服务地址
ENV_PROMETHEUS_SERVER: "http://ops-prometheus:9090"
ENV_PROMETHEUS_GRAFANA: "http://ops-prometheus:3000"
ENV_PROMETHEUS_ALERT: "http://ops-prometheus:9093"
ENV_PROMETHEUS_KARMA: "http://ops-prometheus:8080"
ENV_PROMETHEUS_KAFKA: "kafka_1/ops-agent:9308"
ENV_PROMETHEUS_ELASTICSEARCH: "elasticsearch_1/ops-agent:9114"
ENV_PROMETHEUS_REDIS: "redis_1/ops-agent:9121"
ENV_PROMETHEUS_MONGODB: "mongodb_1/ops-agent:9216"
ENV_PROMETHEUS_MYSQL: "mysql_1/ops-agent:9104"
# 以下是存储组件的连接信息,部署时按照实际环境情况进行修改环境变量值
ENV_MYSQL_HOST: "192.168.1.11"
ENV_MYSQL_PORT: "3306"
ENV_MYSQL_USERNAME: "root"
ENV_MYSQL_PASSWORD: "changeme"
ENV_MONGODB_URI: "mongodb://root:changeme@192.168.1.12:27017"
ENV_MONGODB_OPTIONS: "?authSource=admin"
ENV_REDIS_HOST: "192.168.1.13"
ENV_REDIS_PORT: "6379"
ENV_REDIS_PASSWORD: "changeme"
ENV_KAFKA_ENDPOINTS: "192.168.1.14:9092"
ENV_ELASTICSEARCH_ENDPOINTS: "http://192.168.1.15:9200"
ENV_ELASTICSEARCH_PASSWORD: "elastic:changeme"
ENV_FLINK_URL: "http://flink-jobmanager.flink:8081"

---

apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-gateway
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-gateway
template:
metadata:
labels:
app: ops-gateway
spec:
nodeSelector:
hap-ops: "true"
containers:
- name: ops-gateway
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-gateway:1.1.0
envFrom:
- configMapRef:
name: ops-config
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"

---

apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-prometheus
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-prometheus
template:
metadata:
labels:
app: ops-prometheus
spec:
nodeSelector:
hap-ops: "true"
containers:
- name: ops-prometheus
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-prometheus:1.1.0
volumeMounts:
- mountPath: /data/
name: prometheus-data
envFrom:
- configMapRef:
name: ops-config
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"
volumes:
- name: prometheus-data
hostPath:
path: /data/ops-prometheus-data # 持久化存储路径
type: DirectoryOrCreate # 如果目录不存在则创建

---

apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-agent
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-agent
template:
metadata:
labels:
app: ops-agent
spec:
nodeSelector:
hap-ops: "true"
containers:
- name: ops-agent
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-agent:1.1.0
envFrom:
- configMapRef:
name: ops-config
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"

---

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ops-nodeagent
namespace: hap-ops
spec:
selector:
matchLabels:
app: ops-nodeagent
template:
metadata:
labels:
app: ops-nodeagent
spec:
containers:
- name: ops-nodeagent
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-nodeagent:1.0.0
envFrom:
- configMapRef:
name: ops-config
ports:
- containerPort: 59100
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"
volumeMounts:
- name: host-root
mountPath: /host
readOnly: true
mountPropagation: HostToContainer
volumes:
- name: host-root
hostPath:
path: /
hostNetwork: true # 使用宿主机网络
hostPID: true # 使用宿主机 PID 命名空间

---

apiVersion: v1
kind: Service
metadata:
name: ops-prometheus
namespace: hap-ops
spec:
selector:
app: ops-prometheus
ports:
- name: server
port: 9090
targetPort: 9090
- name: grafana
port: 3000
targetPort: 3000
type: ClusterIP

---

apiVersion: v1
kind: Service
metadata:
name: ops-agent
namespace: hap-ops
spec:
selector:
app: ops-agent
ports:
- name: prometheus
port: 9104
targetPort: 9104
- name: mongodb
port: 9216
targetPort: 9216
- name: redis
port: 9121
targetPort: 9121
- name: kafka
port: 9308
targetPort: 9308
- name: elasticsearch
port: 9114
targetPort: 9114
type: ClusterIP

---

apiVersion: v1
kind: Service
metadata:
name: ops-gateway
namespace: hap-ops
spec:
selector:
app: ops-gateway
ports:
- name: gateway
port: 48881
targetPort: 48881
nodePort: 30081
type: NodePort
EOF

创建命名空间

kubectl create ns hap-ops
  • 默认部署在 hap-ops 命名空间

启动运维平台服务

kubectl apply -f ops.yaml
  • 停止命令:kubectl delete -f ops.yaml

创建启停脚本

cd /data/mingdao/script/kubernetes/ops/

# 创建 start_ops.sh
cat > start_ops.sh << 'EOF'
#!/bin/bash
baseDir=$(dirname "$0")

kubectl apply -f $baseDir/ops.yaml
EOF

# 创建 stop_ops.sh
cat > stop_ops.sh << 'EOF'
#!/bin/bash
baseDir=$(dirname "$0")

kubectl delete -f $baseDir/ops.yaml
EOF

# 创建 restart_ops.sh
cat > restart_ops.sh << 'EOF'
#!/bin/bash

START_DATATIME=$(date +%Y%m%d%H%M%S)
baseDir=$(dirname "$0")
ns=hap-ops
count=120
grep_n=ops-prometheus

echo "Stopping ops-prometheus..."
/bin/bash "$baseDir/stop_ops.sh"

while [ $count -gt 0 ]; do
if ! kubectl -n "$ns" get pods | grep -v grep | grep -q $grep_n; then
echo "$grep_n stopped successfully."
/bin/bash "$baseDir/start_ops.sh"
break
fi

echo -n "."
count=$((count-1))
sleep 1
done

if [ $count -lt 0 ]; then
echo "Failed to stop ops-prometheus within 2 minutes."
exit 1
else
echo "$grep_n stopped and started successfully within 2 minutes."
fi

END_DATATIME=$(date +%Y%m%d%H%M%S)
ELAPSED_TIME=$((END_DATATIME - START_DATATIME))
echo "Total time elapsed: $ELAPSED_TIME seconds"
EOF

#### 赋权
chmod +x /data/mingdao/script/kubernetes/ops/{start_ops.sh,stop_ops.sh,restart_ops.sh}

检查运维平台服务状态

kubectl -n hap-ops get pod -o wide
  • 正常 READY 一列都是 1/1 状态

配置 Nginx 反向代理

cat > hap-ops.conf << 'EOF'
upstream hap-ops {
server 172.29.202.34:30081; # 替换为部署运维平台 K8S 节点的IP
}

server {
listen 48881;
server_name _;
access_log /data/logs/weblogs/hap-ops.log main;
error_log /data/logs/weblogs/hap-ops.error.log;

underscores_in_headers on;
client_max_body_size 2048m;
gzip on;
gzip_proxied any;
gzip_disable "msie6";
gzip_vary on;
gzip_min_length 512;
gzip_comp_level 6;
gzip_buffers 16 8k;
gzip_types text/plain text/css application/json application/x-javascript application/javascript application/octet-stream text/xml application/xml application/xml+rss text/javascript image/jpeg image/gif image/png;

location / {
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Host $http_host;
proxy_pass http://hap-ops;
}
}
EOF
  • 推荐访问入口使用 48881 端口,保持与运维平台后端固定端口一致。

访问运维平台

以上述 Nginx 代理为例,访问 Nginx 入口:

http://hap-ops.demo.com:48881
  • 登录 Token 为 ops.yaml 中的 ENV_OPS_TOKEN 环境变量值