/
/
7月30日10:00分左右现场有一组S12508设备在凌晨5点左右发生异常重启,监控设备上看到有2次流量大幅度下降,板卡自动重启恢复后,业务已恢复正常。
针对故障恢复后收集的诊断信息和logfile,可以明确是1框slot0主控故障导致,具体故障过程如下:
诊断信息记录中,1框slot0主控在Jul 30 05:47左右开始没有任何记录产生了,说明当时主控cpu发生了异常,同时1框slot1主控记录到与1框slot 0的板间通信业异常的日志记录,佐证了这点:
%@3600^Jul 30 05:47:51:855 2023 DC1-SRV-3D03-S12508-D-01 DIAG/3/ERR: -Chassis=1-Slot=1.1; f18830f [316]: Cioctl failed!, p1=2, p2=0, p3=648, p4=0x10027bc4, p5=1073807361
%@3601^Jul 30 05:47:54:914 2023 DC1-SRV-3D03-S12508-D-01 DEVD/3/DRV_DEV_DIAG_ERR_INFO: 0xf110101 [650]: DEVD: connect socket failed! maybe server not ready. code=-112, port=17017, lip=2048
%@3602^Jul 30 05:47:54:914 2023 DC1-SRV-3D03-S12508-D-01 DEVD/3/DRV_DEV_DIAG_ERR_INFO: 0xf110101 [933]: DEVD: LIPC connection invalid, chassis=1, slot=0, cpu=0, dstslot=0.
%@3603^Jul 30 05:47:54:914 2023 DC1-SRV-3D03-S12508-D-01 DEVD/3/DRV_DEV_DIAG_ERR_INFO: 0xf110602 [544]: IPC send fail. p1=0, p2=428, p4=1073807361
%@3604^Jul 30 05:47:54:914 2023 DC1-SRV-3D03-S12508-D-01 DEVD/3/DRV_DEV_DIAG_ERR_INFO: 0xf110601 [1431]: Failed to backup information.
%@3605^Jul 30 05:47:54:914 2023 DC1-SRV-3D03-S12508-D-01 DEVD/3/DRV_DEV_DIAG_ERR_INFO: f118505 [1316]: Failed to backup power info! p1=1073807361
因为1框slot0 cpu故障,无法收发堆叠hello报文及其它任何控制报文,20s后2框无法收到1框主发送的堆叠握手报文超时触发分裂:
%@1965%Jul 30 05:47:51:777 2023 DC1-SRV-3D03-S12508-D-01 STM/2/STM_LINK_STATUS_TIMEOUT: IRF port 1 is down because heartbeat timed out.
%@1966%Jul 30 05:47:51:777 2023 DC1-SRV-3D03-S12508-D-01 STM/3/STM_LINK_STATUS_DOWN: IRF port 1 is down.
===============display kernel reboot 20 verbose chassis 1 slot 1===============
--------------------- Reboot record 1 ---------------------
Recorded at : 2023-07-30 05:52:27.123444
Occurred at : 2023-07-30 05:47:31.261599
Reason : 0x10a26
Thread : swapper (TID: 0)
Context : irq context
Chassis : 1
Target Chassis : 1
Slot : 0
Target Slot : 0
Cpu : 1
VCPU ID : 0
同时因为1框其它单板也无法收到slot0的握手报文,导致所有1框的单板握手重启,1分钟左右发送主备倒换1-1切换为全局主:
%@1969%Jul 30 05:48:14:153 2023 DC1-SRV-3D03-S12508-D-01 DEV/5/BOARD_REBOOT: -Chassis=1-Slot=9; Board is rebooting on chassis 1 slot 9.
%@1970%Jul 30 05:48:13:678 2023 DC1-SRV-3D03-S12508-D-01 DEV/5/BOARD_REBOOT: -Chassis=1-Slot=2; Board is rebooting on chassis 1 slot 2.
%@1971%Jul 30 05:48:39:861 2023 DC1-SRV-3D03-S12508-D-01 HA/5/HA_STANDBY_TO_MASTER: Standby board in chassis 1 slot 1 changed to master.
%@1972%Jul 30 05:48:39:869 2023 DC1-SRV-3D03-S12508-D-01 STM/3/STM_LINK_STATUS_DOWN: IRF port 1 is down.
--------------------- Reboot record 2 ---------------------
Recorded at : 2023-07-30 05:52:17.348169
Occurred at : 2023-07-30 05:48:15.729168
Reason : 0x5000311
Thread : devd (TID: 142)
Context : thread context
Chassis : 1
Target Chassis : 1
Slot : 5
05:51:28左右堆叠板启动并且堆叠口up,此时与2框的通信正常,2框作为备重启加入1框,然后全部恢复正常。
%@2478%Jul 30 05:51:28:238 2023 DC1-SRV-3D03-S12508-D-01 STM/6/STM_LINK_STATUS_UP: IRF port 1 is up.
%@2479%Jul 30 05:51:28:847 2023 DC1-SRV-3D03-S12508-D-01 STM/4/STM_LINK_RECOVERY: Merge occurs.
--------------------- Reboot record 3 ---------------------
Recorded at : 2023-07-30 05:58:52.693057
Occurred at : 2023-07-30 05:51:32.794750
Reason : 0x22000a11
Thread : PTMR (TID: 58)
Context : thread context
Chassis : 2
Target Chassis : 2
Slot : 5
Target Slot : 5
Cpu : 0
VCPU ID : 0
综上,由于1框slot0 cpu硬件故障,导致堆叠异常分裂合并动作,业务受损。建议更换1框slot0单板解决。
返修1框slot 0单板
该案例暂时没有网友评论
✖
案例意见反馈
亲~登录后才可以操作哦!
确定你的邮箱还未认证,请认证邮箱或绑定手机后进行当前操作