ceph常见ERROR告警处理方式

1、类别:scrub errors,pg inconsistent

#故障现象:
root@ceph01:~# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
    pg 5.33 is active+clean+inconsistent, acting [79,80,34]

#形成原因:
    可能是之前,有过故障,出现了数据不一致的情况,当有新数据落入对应的osd后,发出告警,数据不落在该处时,发现不了,发现后及时修复即可。
    ceph也会定期deep scrub,若是不能自愈就会告警

#查看对应osd的日志
root@hkhdd001:~# systemctl status ceph-osd@126.service 
● ceph-osd@126.service - Ceph object storage daemon osd.126
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Wed 2023-02-22 18:22:11 HKT; 2 months 18 days ago
    Process: 4145337 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 126 (code=exited, status=0/SUCCESS)
   Main PID: 4145346 (ceph-osd)
      Tasks: 61
     Memory: 11.4G
        CPU: 3d 2h 55min 2.691s
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@126.service
             └─4145346 /usr/bin/ceph-osd -f --cluster ceph --id 126 --setuser ceph --setgroup ceph

May 12 08:22:46 hkhdd001 sudo[2570878]: pam_unix(sudo:session): session closed for user root
May 12 08:22:47 hkhdd001 sudo[2570898]:     ceph : PWD=/ ; USER=root ; COMMAND=/usr/sbin/smartctl -x --json=o /dev/sdf
May 12 08:22:47 hkhdd001 sudo[2570898]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64045)
May 12 08:22:48 hkhdd001 sudo[2570898]: pam_unix(sudo:session): session closed for user root
May 12 08:22:49 hkhdd001 sudo[2570901]:     ceph : PWD=/ ; USER=root ; COMMAND=/usr/sbin/nvme st10000ne000-3ap101 smart-log-add --json /dev/sdf
May 12 08:22:49 hkhdd001 sudo[2570901]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64045)
May 12 08:22:49 hkhdd001 sudo[2570901]: pam_unix(sudo:session): session closed for user root
May 12 13:33:05 hkhdd001 ceph-osd[4145346]: 2023-05-12T13:33:05.831+0800 7f83d871a700 -1 log_channel(cluster) log [ERR] : 5.2cc shard 146 soid 5:33551cd7:::rbd_data.2e64f3c9858639.000000000003a562:head : candidate >
May 12 13:47:29 hkhdd001 ceph-osd[4145346]: 2023-05-12T13:47:29.097+0800 7f83d871a700 -1 log_channel(cluster) log [ERR] : 5.2cc deep-scrub 0 missing, 1 inconsistent objects
May 12 13:47:29 hkhdd001 ceph-osd[4145346]: 2023-05-12T13:47:29.097+0800 7f83d871a700 -1 log_channel(cluster) log [ERR] : 5.2cc deep-scrub 1 errors

#处理方式:
    root@ceph01:~# ceph pg repair 5.33
    instructing pg 5.33 on osd.79 to repair

#另一种处理思路:
    systemctl restart ceph-osd@79.service

2、类别:Module 'devicehealth' has failed: disk I/O error

#故障现象:
root@A1:~# ceph -s
  cluster:
    id:     c9732c40-e843-4865-8f73-9e61551c993d
    health: HEALTH_ERR
            Module 'devicehealth' has failed: disk I/O error

root@A1:~# ceph health detail
HEALTH_ERR Module 'devicehealth' has failed: disk I/O error; 1 mgr modules have recently crashed
[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed: disk I/O error
    Module 'devicehealth' has failed: disk I/O error

#解决办法:
    新创建mgr,让pve自动创建一个 .mgr的pool,创建后,err告警就会小时
    注意:ceph的不同版本,自动创建的这个.mgr 健康健康pool 名字不一样

file

3、类别:osd full

参考文章:如何处理因osd full导致的ceph集群停服

4、类别:Module 'balancer' has failed: ('3.1ef1',)

事情的起因:pool的pg过大,调整了pg,等待自动收缩的过程中出现的

告警现象:
root@QS0706:~# ceph -s
  cluster:
    id:     5b41e869-7012-492b-9445-ee462338f47c
    health: HEALTH_ERR
            Module 'balancer' has failed: ('3.1ef1',)
            1 pools have too many placement groups

  services:
    mon: 5 daemons, quorum HA1614,HA1615,HA1616,HA1617,HA1618 (age 11w)
    mgr: HA1120(active, since 2M), standbys: HA1614, HA1618, HA1617, HA1616, HA1615
    osd: 296 osds: 296 up (since 7h), 296 in (since 5h); 815 remapped pgs

  task status:

  data:
    pools:   3 pools, 9893 pgs
    objects: 28.45M objects, 107 TiB
    usage:   332 TiB used, 243 TiB / 575 TiB avail
    pgs:     2405834/85344147 objects misplaced (2.819%)
             9078 active+clean
             659  active+clean+remapped
             137  active+remapped+backfill_wait
             19   active+remapped+backfilling

  io:
    client:   406 MiB/s rd, 240 MiB/s wr, 17.71k op/s rd, 13.22k op/s wr
    recovery: 929 MiB/s, 0 keys/s, 236 objects/s

root@QS0706:~# ceph health  detail 
HEALTH_ERR Module 'balancer' has failed: ('3.1ef1',); 1 pools have too many placement groups
[ERR] MGR_MODULE_ERROR: Module 'balancer' has failed: ('3.1ef1',)
    Module 'balancer' has failed: ('3.1ef1',)
[WRN] POOL_TOO_MANY_PGS: 1 pools have too many placement groups
    Pool volume-hdd has 2048 placement groups, should have 512

告警日志:
root@HA1120:~# systemctl status ceph-mgr@HA1120.service 
● ceph-mgr@HA1120.service - Ceph cluster manager daemon
   Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: enabled)
  Drop-In: /usr/lib/systemd/system/ceph-mgr@.service.d
           └─ceph-after-pve-cluster.conf
   Active: active (running) since Thu 2023-04-20 22:32:32 CST; 3 months 7 days ago
 Main PID: 1861 (ceph-mgr)
    Tasks: 70 (limit: 7372)
   Memory: 1.8G
   CGroup: /system.slice/system-ceph\x2dmgr.slice/ceph-mgr@HA1120.service
           └─1861 /usr/bin/ceph-mgr -f --cluster ceph --id HA1120 --setuser ceph --setgroup ceph

Jul 28 16:53:15 HA1120 ceph-mgr[1861]: 2023-07-28T16:53:15.431+0800 7fe3149c4700 -1 Traceback (most recent call last):
Jul 28 16:53:15 HA1120 ceph-mgr[1861]:   File "/usr/share/ceph/mgr/balancer/module.py", line 686, in serve
Jul 28 16:53:15 HA1120 ceph-mgr[1861]:     r, detail = self.optimize(plan)
Jul 28 16:53:15 HA1120 ceph-mgr[1861]:   File "/usr/share/ceph/mgr/balancer/module.py", line 966, in optimize
Jul 28 16:53:15 HA1120 ceph-mgr[1861]:     return self.do_crush_compat(plan)
Jul 28 16:53:15 HA1120 ceph-mgr[1861]:   File "/usr/share/ceph/mgr/balancer/module.py", line 1056, in do_crush_compat
Jul 28 16:53:15 HA1120 ceph-mgr[1861]:     pe = self.calc_eval(ms, plan.pools)
Jul 28 16:53:15 HA1120 ceph-mgr[1861]:   File "/usr/share/ceph/mgr/balancer/module.py", line 812, in calc_eval
Jul 28 16:53:15 HA1120 ceph-mgr[1861]:     objects_by_osd[osd] += ms.pg_stat[pgid]['num_objects']
Jul 28 16:53:15 HA1120 ceph-mgr[1861]: KeyError: '3.1ef1'

root@HA1120:~# cat /var/log/ceph/ceph-mgr.HA1120.log | grep 3.1ef1
2023-07-28T16:53:15.431+0800 7fe3149c4700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'balancer' while running on mgr.HA1120: ('3.1ef1',)
KeyError: '3.1ef1'

处理办法:因为balancer属于mgr,找了一圈确实不知道如何处理,最后重启了 HA1120上的mgr服务就好了
声明:本文为原创,作者为 辣条①号,转载时请保留本声明及附带文章链接:https://boke.wsfnk.com/archives/1074.html
谢谢你请我吃辣条谢谢你请我吃辣条

如果文章对你有帮助,欢迎点击上方按钮打赏作者

最后编辑于:2023/7/28作者: 辣条①号

目标:网络规划设计师、系统工程师、ceph存储工程师、云计算工程师。 不负遇见,不谈亏欠!

暂无评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注

arrow grin ! ? cool roll eek evil razz mrgreen smile oops lol mad twisted wink idea cry shock neutral sad ???

文章目录