ceph常见ERROR告警处理方式

1、类别:scrub errors,pg inconsistent

#故障现象:
root@ceph01:~# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
    pg 5.33 is active+clean+inconsistent, acting [79,80,34]

#形成原因:
    可能是之前,有过故障,出现了数据不一致的情况,当有新数据落入对应的osd后,发出告警,数据不落在该处时,发现不了,发现后及时修复即可。
    ceph也会定期deep scrub,若是不能自愈就会告警

#查看对应osd的日志
root@hkhdd001:~# systemctl status ceph-osd@126.service 
● ceph-osd@126.service - Ceph object storage daemon osd.126
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Wed 2023-02-22 18:22:11 HKT; 2 months 18 days ago
    Process: 4145337 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 126 (code=exited, status=0/SUCCESS)
   Main PID: 4145346 (ceph-osd)
      Tasks: 61
     Memory: 11.4G
        CPU: 3d 2h 55min 2.691s
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@126.service
             └─4145346 /usr/bin/ceph-osd -f --cluster ceph --id 126 --setuser ceph --setgroup ceph

May 12 08:22:46 hkhdd001 sudo[2570878]: pam_unix(sudo:session): session closed for user root
May 12 08:22:47 hkhdd001 sudo[2570898]:     ceph : PWD=/ ; USER=root ; COMMAND=/usr/sbin/smartctl -x --json=o /dev/sdf
May 12 08:22:47 hkhdd001 sudo[2570898]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64045)
May 12 08:22:48 hkhdd001 sudo[2570898]: pam_unix(sudo:session): session closed for user root
May 12 08:22:49 hkhdd001 sudo[2570901]:     ceph : PWD=/ ; USER=root ; COMMAND=/usr/sbin/nvme st10000ne000-3ap101 smart-log-add --json /dev/sdf
May 12 08:22:49 hkhdd001 sudo[2570901]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64045)
May 12 08:22:49 hkhdd001 sudo[2570901]: pam_unix(sudo:session): session closed for user root
May 12 13:33:05 hkhdd001 ceph-osd[4145346]: 2023-05-12T13:33:05.831+0800 7f83d871a700 -1 log_channel(cluster) log [ERR] : 5.2cc shard 146 soid 5:33551cd7:::rbd_data.2e64f3c9858639.000000000003a562:head : candidate >
May 12 13:47:29 hkhdd001 ceph-osd[4145346]: 2023-05-12T13:47:29.097+0800 7f83d871a700 -1 log_channel(cluster) log [ERR] : 5.2cc deep-scrub 0 missing, 1 inconsistent objects
May 12 13:47:29 hkhdd001 ceph-osd[4145346]: 2023-05-12T13:47:29.097+0800 7f83d871a700 -1 log_channel(cluster) log [ERR] : 5.2cc deep-scrub 1 errors

#处理方式:
    root@ceph01:~# ceph pg repair 5.33
    instructing pg 5.33 on osd.79 to repair

#另一种处理思路:
    systemctl restart ceph-osd@79.service

2、类别:Module 'devicehealth' has failed: disk I/O error

#故障现象:
root@A1:~# ceph -s
  cluster:
    id:     c9732c40-e843-4865-8f73-9e61551c993d
    health: HEALTH_ERR
            Module 'devicehealth' has failed: disk I/O error

root@A1:~# ceph health detail
HEALTH_ERR Module 'devicehealth' has failed: disk I/O error; 1 mgr modules have recently crashed
[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed: disk I/O error
    Module 'devicehealth' has failed: disk I/O error

#解决办法:
    新创建mgr,让pve自动创建一个 .mgr的pool,创建后,err告警就会小时
    注意:ceph的不同版本,自动创建的这个.mgr 健康健康pool 名字不一样

file

3、类别:osd full

参考文章:如何处理因osd full导致的ceph集群停服

声明:本文为原创,作者为 辣条①号,转载时请保留本声明及附带文章链接:https://boke.wsfnk.com/archives/1074.html
谢谢你请我吃辣条谢谢你请我吃辣条

如果文章对你有帮助,欢迎点击上方按钮打赏作者

最后编辑于:2023/5/12作者: 辣条①号

现在在做什么? 接下来打算做什么? 你的目标什么? 期限还有多少? 进度如何? 不负遇见,不谈亏欠!

暂无评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注

arrow grin ! ? cool roll eek evil razz mrgreen smile oops lol mad twisted wink idea cry shock neutral sad ???

文章目录