基于 Kubernetes 精简版集群模式
部署 Node Exporter
当运维平台部署于 Kubernetes 集群内时,ops-nodeagent 容器内置了 Node Exporter 服务,以 DaemonSet 资源形式运行在 Kubernetes 集群各节点。
如需监控 Kubernetes 集群外部的服务器,需在外部服务器部署 Node Exporter 后才能实现监控数据指标采集。
运维平台部署
下载镜像(离线包下载)
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-gateway:1.1.0
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-prometheus:1.1.0
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-agent:1.1.0
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-nodeagent:1.0.0
- 默认情况下
ops-gateway,ops-prometheus,ops-agent镜像仅需在部署运维平台的节点下载,而ops-nodeagent镜像需要 Kubernetes 集群内每个节点都下载。
创建节点标签
默认通过 nodeSelector 将运维平台部署在固定一个节点上,ops-prometheus 服务中的监控数据默认通过 hostPath 方式持久化存储在本地磁盘。
为部署运维平台服务的节点创建 hap-ops=true 标签:
kubectl label node nodeName hap-ops=true
-
将
nodeName换成要部署运维平台服务的节点名称,可以通过kubectl get node -o wide查看。
创建运维平台服务 Yaml 文件
创建与进入 Yaml 文件存放目录
mkdir -p /data/mingdao/script/kubernetes/ops/
cd /data/mingdao/script/kubernetes/ops/
创建 ops.yaml 文件
cat > ops.yaml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: ops-config
namespace: hap-ops
data:
TZ: "Asia/Shanghai"
ENV_OPS_TOKEN: "SS9PobGG7SDTpcyfSZ1VVmn3gCmy2P52tYk" # 首次部署务必调整,此环境变量值是后续运维平台的访问认证密钥
ENV_PROMETHEUS_HOST: "service_01/192.168.1.5:59100,service_02/192.168.1.6:59100,service_03/192.168.1.7:59100" # 部署时需要改成填写实际的 Node Exporter 服务地址
ENV_PROMETHEUS_SERVER: "http://ops-prometheus:9090"
ENV_PROMETHEUS_GRAFANA: "http://ops-prometheus:3000"
ENV_PROMETHEUS_ALERT: "http://ops-prometheus:9093"
ENV_PROMETHEUS_KARMA: "http://ops-prometheus:8080"
ENV_PROMETHEUS_KAFKA: "kafka_1/ops-agent:9308"
ENV_PROMETHEUS_ELASTICSEARCH: "elasticsearch_1/ops-agent:9114"
ENV_PROMETHEUS_REDIS: "redis_1/ops-agent:9121"
ENV_PROMETHEUS_MONGODB: "mongodb_1/ops-agent:9216"
ENV_PROMETHEUS_MYSQL: "mysql_1/ops-agent:9104"
# 以下是存储组件的连接信息,部署时按照实际环境情况进行修改环境变量值
ENV_MYSQL_HOST: "192.168.1.11"
ENV_MYSQL_PORT: "3306"
ENV_MYSQL_USERNAME: "root"
ENV_MYSQL_PASSWORD: "changeme"
ENV_MONGODB_URI: "mongodb://root:changeme@192.168.1.12:27017"
ENV_MONGODB_OPTIONS: "?authSource=admin"
ENV_REDIS_HOST: "192.168.1.13"
ENV_REDIS_PORT: "6379"
ENV_REDIS_PASSWORD: "changeme"
ENV_KAFKA_ENDPOINTS: "192.168.1.14:9092"
ENV_ELASTICSEARCH_ENDPOINTS: "http://192.168.1.15:9200"
ENV_ELASTICSEARCH_PASSWORD: "elastic:changeme"
ENV_FLINK_URL: "http://flink-jobmanager.flink:8081"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-gateway
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-gateway
template:
metadata:
labels:
app: ops-gateway
spec:
nodeSelector:
hap-ops: "true"
containers:
- name: ops-gateway
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-gateway:1.1.0
envFrom:
- configMapRef:
name: ops-config
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-prometheus
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-prometheus
template:
metadata:
labels:
app: ops-prometheus
spec:
nodeSelector:
hap-ops: "true"
containers:
- name: ops-prometheus
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-prometheus:1.1.0
volumeMounts:
- mountPath: /data/
name: prometheus-data
envFrom:
- configMapRef:
name: ops-config
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"
volumes:
- name: prometheus-data
hostPath:
path: /data/ops-prometheus-data # 持久化存储路径
type: DirectoryOrCreate # 如果目录不存在则创建
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ops-agent
namespace: hap-ops
spec:
replicas: 1
selector:
matchLabels:
app: ops-agent
template:
metadata:
labels:
app: ops-agent
spec:
nodeSelector:
hap-ops: "true"
containers:
- name: ops-agent
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-agent:1.1.0
envFrom:
- configMapRef:
name: ops-config
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ops-nodeagent
namespace: hap-ops
spec:
selector:
matchLabels:
app: ops-nodeagent
template:
metadata:
labels:
app: ops-nodeagent
spec:
containers:
- name: ops-nodeagent
image: registry.cn-hangzhou.aliyuncs.com/mdpublic/ops-nodeagent:1.0.0
envFrom:
- configMapRef:
name: ops-config
ports:
- containerPort: 59100
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "0.1"
memory: "200Mi"
volumeMounts:
- name: host-root
mountPath: /host
readOnly: true
mountPropagation: HostToContainer
volumes:
- name: host-root
hostPath:
path: /
hostNetwork: true # 使用宿主机网络
hostPID: true # 使用宿主机 PID 命名空间
---
apiVersion: v1
kind: Service
metadata:
name: ops-prometheus
namespace: hap-ops
spec:
selector:
app: ops-prometheus
ports:
- name: server
port: 9090
targetPort: 9090
- name: grafana
port: 3000
targetPort: 3000
type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
name: ops-agent
namespace: hap-ops
spec:
selector:
app: ops-agent
ports:
- name: prometheus
port: 9104
targetPort: 9104
- name: mongodb
port: 9216
targetPort: 9216
- name: redis
port: 9121
targetPort: 9121
- name: kafka
port: 9308
targetPort: 9308
- name: elasticsearch
port: 9114
targetPort: 9114
type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
name: ops-gateway
namespace: hap-ops
spec:
selector:
app: ops-gateway
ports:
- name: gateway
port: 48881
targetPort: 48881
nodePort: 30081
type: NodePort
EOF
创建命名空间
kubectl create ns hap-ops
- 默认部署在
hap-ops命名空间
启动运维平台服务
kubectl apply -f ops.yaml
- 停止命令:
kubectl delete -f ops.yaml
创建启停脚本
cd /data/mingdao/script/kubernetes/ops/
# 创建 start_ops.sh
cat > start_ops.sh << 'EOF'
#!/bin/bash
baseDir=$(dirname "$0")
kubectl apply -f $baseDir/ops.yaml
EOF
# 创建 stop_ops.sh
cat > stop_ops.sh << 'EOF'
#!/bin/bash
baseDir=$(dirname "$0")
kubectl delete -f $baseDir/ops.yaml
EOF
# 创建 restart_ops.sh
cat > restart_ops.sh << 'EOF'
#!/bin/bash
START_DATATIME=$(date +%Y%m%d%H%M%S)
baseDir=$(dirname "$0")
ns=hap-ops
count=120
grep_n=ops-prometheus
echo "Stopping ops-prometheus..."
/bin/bash "$baseDir/stop_ops.sh"
while [ $count -gt 0 ]; do
if ! kubectl -n "$ns" get pods | grep -v grep | grep -q $grep_n; then
echo "$grep_n stopped successfully."
/bin/bash "$baseDir/start_ops.sh"
break
fi
echo -n "."
count=$((count-1))
sleep 1
done
if [ $count -lt 0 ]; then
echo "Failed to stop ops-prometheus within 2 minutes."
exit 1
else
echo "$grep_n stopped and started successfully within 2 minutes."
fi
END_DATATIME=$(date +%Y%m%d%H%M%S)
ELAPSED_TIME=$((END_DATATIME - START_DATATIME))
echo "Total time elapsed: $ELAPSED_TIME seconds"
EOF
#### 赋权
chmod +x /data/mingdao/script/kubernetes/ops/{start_ops.sh,stop_ops.sh,restart_ops.sh}
检查运维平台服务状态
kubectl -n hap-ops get pod -o wide
- 正常 READY 一列都是
1/1状态
配置 Nginx 反向代理
cat > hap-ops.conf << 'EOF'
upstream hap-ops {
server 172.29.202.34:30081; # 替换为部署运维平台 K8S 节点的IP
}
server {
listen 48881;
server_name _;
access_log /data/logs/weblogs/hap-ops.log main;
error_log /data/logs/weblogs/hap-ops.error.log;
underscores_in_headers on;
client_max_body_size 2048m;
gzip on;
gzip_proxied any;
gzip_disable "msie6";
gzip_vary on;
gzip_min_length 512;
gzip_comp_level 6;
gzip_buffers 16 8k;
gzip_types text/plain text/css application/json application/x-javascript application/javascript application/octet-stream text/xml application/xml application/xml+rss text/javascript image/jpeg image/gif image/png;
location / {
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Host $http_host;
proxy_pass http://hap-ops;
}
}
EOF
- 推荐访问入口使用
48881端口,保持与运维平台后端固定端口一致。
访问运维平台
以上述 Nginx 代理为例,访问 Nginx 入口:
http://hap-ops.demo.com:48881
- 登录 Token 为
ops.yaml中的ENV_OPS_TOKEN环境变量值