前言
注意标题说的监控dell服务器硬件,指的是监控服务器硬件的状态(磁盘,内存,电源等的状态),不是指监控硬件性能,磁盘的空间,内存等的使用量.而是类似于zabbix监控idrac的snmp获取硬件状态.
现在大部分公司是使用prometheus监控容器和服务,zabbix监控硬件,端口,当然还有其他监控架构.这里就不对比各个监控的优劣了.仅仅是做一篇文档.该文档对基础的内容解释不太详尽,仅适合具备一些prometheus基础的查看.不适合未接触者
前提条件


<2>由于安全问题,对网络一般进行了限制.找一台可以ping通各服务器idrac IP地址的服务器,安装snmp监控组件
<3>prometheus服务器需要能联通snmp_exporter
组件安装
安装依赖
yum -y install gcc gcc-g++ make net-snmp net-snmp-utils net-snmp-libs net-snmp-devel golang git
snmp_exporter安装
<1>下载snmp_exporter
https://github.com/prometheus/snmp_exporter/releases cd /data wget https://github.com/prometheus/snmp_exporter/releases/download/v0.20.0/snmp_exporter-0.20.0.linux-amd64.tar.gz tar xf snmp_exporter-0.20.0.linux-amd64.tar.gz mv snmp_exporter-0.20.0.linux-amd64 snmp_exporter
<2>配置启动方式
根据系统版本配置启动方式,暂时不需要启动(没有生成snmp)
Centos7 cat /usr/lib/systemd/system/snmp-exporter.service [Unit] Description=SNMP exporter Documentation=https://github.com/prometheus/snmp_exporter [Service] ExecStart=/data/snmp_exporter/snmp_exporter \ --config.file=/data/snmp_exporter/snmp.yml \ --web.listen-address=:9116 \ --snmp.wrap-large-counters ExecReload=/bin/kill -HUP $MAINPID KillMode=process Restart=on-failure [Install] WantedBy=multi-user.target 管理方式: systemctl daemon-reload systemctl enable snmp-exporter systemctl restart snmp-exporter systemctl status snmp-exporter systemctl stop snmp-exporter
Centos6 cat /etc/init.d/snmp_exporter #!/bin/bash # chkconfig: 2345 80 80 # description: Start and Stop snmp_exporter # Source function library. . /etc/init.d/functions prog_name="snmp_exporter" prog_path="/data/${prog_name}" pidfile="/var/run/${prog_name}.pid" prog_logs="/data/${prog_name}/${prog_name}.log" options="--config.file=/data/snmp_exporter/snmp.yml --web.listen-address=:9116 --snmp.wrap-large-counters" DESC="snmp_exporter" [ -x "${prog_path}" ] || exit 1 RETVAL=0 start(){
action $"Starting $DESC..." su -s /bin/sh -c "nohup $prog_path $options >> $prog_logs 2>&1 &" 2> /dev/null RETVAL=$? PID=$(pidof ${
prog_path}) [ ! -z "${PID}" ] && echo ${
PID} > ${
pidfile} echo [ $RETVAL -eq 0 ] && touch /var/lock/subsys/$prog_name return $RETVAL } stop(){
echo -n $"Shutting down $prog_name: " killproc -p ${
pidfile} RETVAL=$? echo [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$prog_name return $RETVAL } restart() {
stop start } case "$1" in start) start ;; stop) stop ;; restart) restart ;; status) status $prog_path RETVAL=$? ;; *) echo $"Usage: $0 {start|stop|restart|status}" RETVAL=1 esac ------------------------------------------------------------ cat /etc/sysconfig/snmp_exporter ARGS="" ------------------------------------------------------------ 管理方式: chmod +x /etc/init.d/snmp_exporter chkconfig snmp_exporter on /etc/init.d/snmp_exporter restart
mibs下载并生成snmp.yml
MIB与OID
<1>下载适合自己服务器型号的mib,查看兼容的系统
https://www.dell.com/support/search/zh-cn#q=mibs&sort=relevancy&f:langFacet=[zh]

wget https://dl.dell.com/FOLDER0M/1/Dell-OM-MIBS-940_A00.zip unzip Dell-OM-MIBS-940_A00.zip
<2>查看OID
snmptranslate -Tz -m /root/support/station/mibs/iDRAC-SMIv2.mib cp /usr/share/snmp/mibs/SNMPv2-SMI.txt /root/support/station/mibs/
<3>生成snmp.yml
官方地址: https://github.com/prometheus/snmp_exporter/tree/main/generator#file-format # 配置变量 export GO111MODULE=on export GOPROXY=https://mirrors.aliyun.com/goproxy/ export MIBDIRS=/root/support/station/mibs/ #拉取generator go get github.com/prometheus/snmp_exporter/generator cd ${
GOPATH-$HOME/go}/pkg/mod/github.com/prometheus/snmp_exporter@v0.20.0/generator go build #编辑generator.yml (community要设置为你idrac的snmp团体名) vim generator.yml modules: idrac: walk: - 1.3.6.1 version: 2 timeout: 30s auth: community: public #生成监控指标 ./generator generate cp -r snmp.yml /data/snmp_exporter/
<4>启动snmp_exporter
systemctl restart snmp-exporter /etc/init.d/snmp_exporter restart
<5>测试指标抓取是否正常
http://snmp_exporter的IP:9116

备注: Target填入要抓取的服务器的远程管理卡ip,服务器内部配置的网卡的ip无效 Module:填入该snmp的模块,snmp.yml文件中walk上面 如果你部分的服务器snmp的密码是其他的,建议拷贝一个新的snmp文件,修改文件最末尾的community: xxx
cat snmp.yml


Prometheus配置
prometheus配置
<1>配置从何处读取报警规则
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "rule/*.yml" # - "second_rules.yml"
创建报警规则的目录,在目录中写入报警规则的文件 mkdir rule vim idrac.yml
<2>配置job,设置要收集或排除的指标
方式一 static_configs方式 - job_name: 'IDRAC' scrape_interval: 180s #抓取数据的间隔 scrape_timeout: 180s #抓取数据的超时时间 static_configs: - targets: - 123.123.123.123 #要监控的idrac ip,默认snmp端口161 # - 123.123.123.123:161 #如果是其他端口,也可以加端口 # labels: #labels可根据需求添加标签,例如该idrac对应的内部ip,工作机房等 # IP: 'xxx' # project: 'xxx' metrics_path: /snmp params: module: [dell] # relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: xxxxx:9116 #你的snmp_exporter服务器 该模式特点,要监控哪几台就需要在targets添加几台.如果是几百台会导致prometheus.yml文件行数特别多
方式二 file_sd_configs方式 - job_name: "IDRAC" params: module: - idrac scrape_interval: 180s scrape_timeout: 180s metrics_path: /snmp file_sd_configs: - files: - targets/*.json #读取json文件,目录名称任意,但是得创建 refresh_interval: 5m #该文件载入时间,多长时间载入一次 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: xxxx:9116 #你的snmp_exporter服务器 该模式特点,需要创建json文件,监控项写入json文件,json格式如下: cat targets/idrac.json [ {
"targets": [ "123.123.123.123:161" ], "labels": {
"IP": "xxxx", "Project": "xxx" } }, {
"targets": [ "123.123.123.124:161" ], "labels": {
"IP": "xxx", "Project": "xxx" } } ] or [ {
"targets": [ "123.123.123.123:161", "123.123.123.124:161" ], "labels": {
"IP": "xxxx", "Project": "xxx" } } ]
方式三 consul_sd_file方式 该方式是将监控注册到consul服务中,prometheus通过consul实现服务的自动发现 这里就不详细介绍consul,没有使用过consul和配置过prometheus报警的暂时不建议看这个方式,不容易理解 - job_name: 'IDRAC' params: module: - idrac scrape_interval: 180s scrape_timeout: 180s metrics_path: /snmp consul_sd_configs: - server: 'monitor-consul.com:8500' #这个是你consul服务的域名,也可直接填入ip tag_separator: ',' services: [] relabel_configs: - source_labels: [__meta_consul_tags] regex: .*idrac.* #这个是将你consul打的tags中符合该正则的指标归类到该Job action: keep - source_labels: ['__meta_consul_service_metadata_eth-ip'] #这个是你consul打的标签,在prometheus -> Targets -> IDRAC ->Endpoint展示出来 target_label: __param_target - source_labels: ['__meta_consul_service_address'] target_label: instance - target_label: __address__ replacement: xxx:9116 该模式特点,需要将服务注册到consul,有静态和文件两种注册方式: json示例如下,根据需求写自己的(标签随意,但要符合你报警的钉钉群的关键字,符合alertmanger相关配置) cat consul-idrac.json {
"ID": "IDRAC-xxx", "Name": "IDARC-xxx", "Tags": [ "idrac" ], "Address": "xxx", #IDRAC IP "Meta": {
#consul里的标签,之后标签会重写成prometheus的标签 "eth-ip":"xxx", #服务器业务ip "project":"beijing" #所在地 }, "EnableTagOverride": false, "Check": {
"HTTP": "http://xxxx:9116/metrics", #你的snmp服务器IP和端口.健康检查 "Interval": "10s" }, "Weights": {
"Passing": 10, "Warning": 1 } } 说明:由于健康检查使用的是snmp_exporter实际上检查的是snmp_exporter,因此哪怕前面的IP等内容是错误的,consul状态也是正常.不过不影响prometheus去监控,服务注册到consul后,它只是从consul获取服务的值和标签,然后prometheus再根据自己的配置去进行监控.对于snmp适合第二种json or cat consul-idrac2.json {
"ID": "IDRAC-xxx", "Name": "IDARC-xxx", "Tags": [ "idrac" ], "Address": "xxx:161", "Meta": {
#consul里的标签,之后标签会重写成prometheus的标签 "eth-ip":"xxx", #服务器业务ip "project":"beijing" #所在地 } } 注册 curl --request PUT --data @consul-idrac.json http://monitor-consul.com:8500/v1/agent/service/register?replace-existing-checks=1 取消注册 curl -X PUT http://monitor-consul.com:8500/v1/agent/service/deregister/IDRAC-xxx
效果:

报警规则配置
注意你的snmp.yml中的指标,但是并不是所有的指标都可使用,可以在prometheus上搜索一下


cat rule/idrac.yml groups: - name: IDRAC-物理机硬件运行状态 rules: - alert: IDRAC状态 expr: up{
job=~"IDRAC.*"} == 0 for: 1m labels: status: error annotations: description: "{
{$labels.instance}} IDRAC异常" - alert: 机箱组件整体状态 expr: chassisStatus != 3 for: 1m labels: status: error annotations: summary: "机箱组件总体运行状态异常请及时查看!!" description: "{
{$labels.instance}}机箱组件异常" - alert: 机箱CMOS电池整体状态 expr: systemBatteryStatus != 3 for: 1m labels: status: error annotations: summary: "机箱CMOS电池整体状态异常请及时查看!!" description: "{
{$labels}}机箱CMOS电池状态异常" - alert: 内存条运行状态 expr: memoryDeviceStatus != 3 for: 1m labels: status: error annotations: summary: "内存条状态异常请及时查看!!" description: "{
{$labels.instance}} 内存条 {
{$labels.memoryDeviceIndex}}异常" - alert: 处理器CPU总体状态 expr: processorDeviceStatusStatus != 3 for: 1m labels: status: error annotations: summary: "处理器CPU总体状态异常请及时查看!!" description: "{
{$labels.instance}} 处理器CPU{
{$labels.processorDeviceStatusIndex}}异常" - alert: 网卡状态 expr: networkDeviceStatus != 3 for: 1m labels: status: error annotations: description: "{
{$labels.instance}} 网卡{
{$labels.networkDeviceIndex}}异常" - alert: ps电源总体状态 expr: powerSupplyStatus != 3 for: 1m labels: status: error annotations: summary: "ps电源总体状态异常请及时查看!!" description: "{
{$labels.instance}} ps电源 {
{ $labels.powerSupplyIndex }}状态异常" - alert: 存储控制器总体状态 expr: globalStorageStatus != 3 for: 1m labels: status: error annotations: summary: "存储控制器状态异常请及时查看!!" description: "{
{$labels.instance}} 存储控制器异常" - alert: 物理系统组件总体状态 expr: globalSystemStatus != 3 for: 1m labels: status: error annotations: summary: "物理系统总体组件运行状态异常请及时查看!!" description: "{
{$labels.instance}} 物理系统组件异常" - alert: 物理磁盘运行状态 expr: physicalDiskState != 3 for: 1m labels: status: error annotations: summary: "物理磁盘运行状态异常请及时查看!!" description: "{
{$labels.instance}} 物理磁盘{
{$labels. physicalDiskNumber}}异常" - alert: 虚拟磁盘运行状态 expr: virtualDiskState != 2 for: 1m labels: status: error annotations: summary: "虚拟磁盘运行状态异常请及时查看!!" description: "{
{$labels.instance}} 虚拟磁盘{
{$labels.virtualDiskNumber}}异常"
重新加载prometheus curl -X POST http://xxxx:9090/-/reload #prometheus的IP
要想报警 还需要配置 报警插件Alertmanager 和 钉钉插件prometheus-webhook-dingtalk ,并在dingding群添加机器人.这里就不演示报警流程了.
补充
发布者:全栈程序员-站长,转载请注明出处:https://javaforall.net/215986.html原文链接:https://javaforall.net
