hotspare的copyback

hotspare的copyback最近做硬件巡检 发现一部分硬盘出现了坏块 同事就帮忙去协调处理这个事情 晚些时候接到了现场工程师的电话 问我可以不可以换 简单确认是 raid5 的盘 所以只能一个盘一个盘来换 首先确定来第一块要换的盘 位于 slot1 也就是

最近做硬件巡检,发现一部分硬盘出现了坏块,同事就帮忙去协调处理这个事情,晚些时候接到了现场工程师的电话,问我可以不可以换,简单确认是raid5的盘。所以只能一个盘一个盘来换。

首先确定来第一块要换的盘,位于slot 1,也就是第二块盘,简单确认之后,那位兄弟说已经换好了,我使用megacli来查看,感觉结果比较奇怪。
查看到的结果如下,一个firmware显示为Unconfigured,一个显示为Rebuild
# /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL|grep “Firmware state”
Firmware state: Online, Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Rebuild
按照raid 10的思路,这应该是slot 11的盘,看起来在做rebuild了,第一感觉是不是把盘换错了,自己还略带严肃的告诉同事,让他来帮忙确认一下是不是弄错了。
然后他拍了张现场的机器照片,我是看了好一会,也没分出个明白来。
如果用megacli来进行确认,发现确实是solt 11的盘在做rebuild.
# /opt/MegaRAID/MegaCli/MegaCli64 -pdrbld -showprog -physdrv[32:11] -aALL                                    
Rebuild Progress on Device at Enclosure 32, Slot 11 Completed 31% in 40 Minutes.
如果说盘换错了,那么就只能等换完之后,再重新弄一次了,一边和他商量这多出来的盘该怎么办,一边在琢磨这个Unconfigured的firmware是什么意思。
把这个状态发给系统的同事,帮忙来看,同事说删完raid0,磁盘状态会发生改变,这个只是说明热备盘是正常的。至于怎么修复这个问题。这个操作还是很少做的,通过命令也可以动态添加,但是大家都心里没底。
找了另外一个同事来看,了解了事情的原委之后,他说是正常的。slot 11的是一个热备盘,那么在rebuild完成之后,就会开始copyback的操作把数据同步到slot 1中。等到同步完成,slot 11的盘就会是hotspare的状态了。
按照这种情况,现场的同事是真没换错,也是心生愧疚。略微等了一会。再次观察,果然状态发生了变化。
# /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL|grep “Firmware state”
Firmware state: Online, Spun Up
Firmware state: Copyback
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

copybak的同时查看slot 1的盘的情况如下:
Enclosure Device ID: 32
Slot Number: 1
Enclosure position: N/A
Device Id: 1
WWN: 5000C50088A47D70
Sequence Number: 8
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Copyback
Device Firmware Level: ES66
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c50088a47d71
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE STSS     ES666SLA5XVH            
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :54C (129.20 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port’s Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port’s Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : No
这个时候通过megacli可以看到rebuild的进度,但是copyback的进度却很难得到,查看了一番,发现已经有业内高人准备好了一个小脚本。
https://fritshoogland.wordpress.com/2013/05/03/watching-the-copyback-progress-of-a-new-disk-on-an-exadata-compute-node/
直接拿过来用即可。
while $(true); do /opt/MegaRAID/MegaCli/MegaCli64 adpeventlog getlatest 200 -f ~/adpeventlog.txt a0; awk ‘/^Time/{TIME=$0};/Seconds/{SECS=$5}/^Event Desc/{printf(“%25.25s %5.5s %s\n”,TIME,SECS,$0);TIME=” “;SECS=””}’ ~/adpeventlog.txt|grep -v fan|tac; sleep 5; done
监控的结果如下,这样就可以及时的给现场的同事反馈进度了。
Time: Tue Nov 24 13:42:30       Event Description: CopyBack progress on PD 01(e0x20/s1) is 35.98%(1948s)
Time: Tue Nov 24 13:43:31       Event Description: CopyBack progress on PD 01(e0x20/s1) is 36.98%(2009s)
Time: Tue Nov 24 13:44:26       Event Description: CopyBack progress on PD 01(e0x20/s1) is 37.97%(2064s)
Time: Tue Nov 24 13:45:21       Event Description: CopyBack progress on PD 01(e0x20/s1) is 38.97%(2119s)
Time: Tue Nov 24 13:46:16       Event Description: CopyBack progress on PD 01(e0x20/s1) is 39.97%(2174s)
Time: Tue Nov 24 13:47:13       Event Description: CopyBack progress on PD 01(e0x20/s1) is 40.97%(2231s)
Time: Tue Nov 24 13:48:10       Event Description: CopyBack progress on PD 01(e0x20/s1) is 41.97%(2288s)
Time: Tue Nov 24 13:49:05       Event Description: CopyBack progress on PD 01(e0x20/s1) is 42.97%(2343s)
Time: Tue Nov 24 13:50:00       Event Description: CopyBack progress on PD 01(e0x20/s1) is 43.97%(2398s)
做完之后再次查看firmware的状态,slot 11的盘已经是hotspare状态了。
# /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL|grep “Firmware state”
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Hotspare, Spun Up
再次用查看copyback的脚本查看,可以看到更多的明细信息。
Time: Tue Nov 24 14:47:59       Event Description: CopyBack progress on PD 01(e0x20/s1) is 99.94%(5877s)
Time: Tue Nov 24 14:48:02       Event Description: CopyBack complete on PD 01(e0x20/s1) from PD 0b(e0x20/s11)
Time: Tue Nov 24 14:48:03       Event Description: State change on PD 01(e0x20/s1) from COPYBACK(20) to ONLINE(18)
Time: Tue Nov 24 14:48:03       Event Description: Dedicated Hot Spare created on PD 0b(e0x20/s11) (ded,rev,ea,ac=1)
Time: Tue Nov 24 14:48:03       Event Description: State change on PD 0b(e0x20/s11) from ONLINE(18) to HOT SPARE(2)

有设定copyback 的话,当slot 5 放入新的状况良好硬碟时, slot 12 hot spare 会做sync 到slot 5 .重要的是做此sync时,你对整组VD 的写入 变化数据 slot 5 会跟slot 12 同步。完成Copy back后,这时后slot 12 中硬碟可以继续做用.原阵列架构 ,DG 都不会改变

用同事的话来说,raid真是一个很精细复杂的小系统,里面的内容非常丰富。

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net//viewspace-/,如需转载,请注明出处,否则将追究法律责任。
























































































































转载于:http://blog.itpub.net//viewspace-/

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请联系我们举报,一经查实,本站将立刻删除。

发布者:全栈程序员-站长,转载请注明出处:https://javaforall.net/213255.html原文链接:https://javaforall.net

(0)
上一篇 2026年3月18日 下午6:14
下一篇 2026年3月18日 下午6:14


相关推荐

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

关注全栈程序员社区公众号