文章目录
1、类别:scrub errors,pg inconsistent
#故障现象:
root@ceph01:~# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 5.33 is active+clean+inconsistent, acting [79,80,34]
#形成原因:
可能是之前,有过故障,出现了数据不一致的情况,当有新数据落入对应的osd后,发出告警,数据不落在该处时,发现不了,发现后及时修复即可。
ceph也会定期deep scrub,若是不能自愈就会告警
#查看对应osd的日志
root@hkhdd001:~# systemctl status ceph-osd@126.service
● ceph-osd@126.service - Ceph object storage daemon osd.126
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: active (running) since Wed 2023-02-22 18:22:11 HKT; 2 months 18 days ago
Process: 4145337 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 126 (code=exited, status=0/SUCCESS)
Main PID: 4145346 (ceph-osd)
Tasks: 61
Memory: 11.4G
CPU: 3d 2h 55min 2.691s
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@126.service
└─4145346 /usr/bin/ceph-osd -f --cluster ceph --id 126 --setuser ceph --setgroup ceph
May 12 08:22:46 hkhdd001 sudo[2570878]: pam_unix(sudo:session): session closed for user root
May 12 08:22:47 hkhdd001 sudo[2570898]: ceph : PWD=/ ; USER=root ; COMMAND=/usr/sbin/smartctl -x --json=o /dev/sdf
May 12 08:22:47 hkhdd001 sudo[2570898]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64045)
May 12 08:22:48 hkhdd001 sudo[2570898]: pam_unix(sudo:session): session closed for user root
May 12 08:22:49 hkhdd001 sudo[2570901]: ceph : PWD=/ ; USER=root ; COMMAND=/usr/sbin/nvme st10000ne000-3ap101 smart-log-add --json /dev/sdf
May 12 08:22:49 hkhdd001 sudo[2570901]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64045)
May 12 08:22:49 hkhdd001 sudo[2570901]: pam_unix(sudo:session): session closed for user root
May 12 13:33:05 hkhdd001 ceph-osd[4145346]: 2023-05-12T13:33:05.831+0800 7f83d871a700 -1 log_channel(cluster) log [ERR] : 5.2cc shard 146 soid 5:33551cd7:::rbd_data.2e64f3c9858639.000000000003a562:head : candidate >
May 12 13:47:29 hkhdd001 ceph-osd[4145346]: 2023-05-12T13:47:29.097+0800 7f83d871a700 -1 log_channel(cluster) log [ERR] : 5.2cc deep-scrub 0 missing, 1 inconsistent objects
May 12 13:47:29 hkhdd001 ceph-osd[4145346]: 2023-05-12T13:47:29.097+0800 7f83d871a700 -1 log_channel(cluster) log [ERR] : 5.2cc deep-scrub 1 errors
#处理方式:
root@ceph01:~# ceph pg repair 5.33
instructing pg 5.33 on osd.79 to repair
#另一种处理思路:
systemctl restart ceph-osd@79.service
2、类别:Module 'devicehealth' has failed: disk I/O error
#故障现象:
root@A1:~# ceph -s
cluster:
id: c9732c40-e843-4865-8f73-9e61551c993d
health: HEALTH_ERR
Module 'devicehealth' has failed: disk I/O error
root@A1:~# ceph health detail
HEALTH_ERR Module 'devicehealth' has failed: disk I/O error; 1 mgr modules have recently crashed
[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed: disk I/O error
Module 'devicehealth' has failed: disk I/O error
#解决办法:
新创建mgr,让pve自动创建一个 .mgr的pool,创建后,err告警就会小时
注意:ceph的不同版本,自动创建的这个.mgr 健康健康pool 名字不一样
3、类别:osd full
4、类别:Module 'balancer' has failed: ('3.1ef1',)
事情的起因:pool的pg过大,调整了pg,等待自动收缩的过程中出现的
告警现象:
root@QS0706:~# ceph -s
cluster:
id: 5b41e869-7012-492b-9445-ee462338f47c
health: HEALTH_ERR
Module 'balancer' has failed: ('3.1ef1',)
1 pools have too many placement groups
services:
mon: 5 daemons, quorum HA1614,HA1615,HA1616,HA1617,HA1618 (age 11w)
mgr: HA1120(active, since 2M), standbys: HA1614, HA1618, HA1617, HA1616, HA1615
osd: 296 osds: 296 up (since 7h), 296 in (since 5h); 815 remapped pgs
task status:
data:
pools: 3 pools, 9893 pgs
objects: 28.45M objects, 107 TiB
usage: 332 TiB used, 243 TiB / 575 TiB avail
pgs: 2405834/85344147 objects misplaced (2.819%)
9078 active+clean
659 active+clean+remapped
137 active+remapped+backfill_wait
19 active+remapped+backfilling
io:
client: 406 MiB/s rd, 240 MiB/s wr, 17.71k op/s rd, 13.22k op/s wr
recovery: 929 MiB/s, 0 keys/s, 236 objects/s
root@QS0706:~# ceph health detail
HEALTH_ERR Module 'balancer' has failed: ('3.1ef1',); 1 pools have too many placement groups
[ERR] MGR_MODULE_ERROR: Module 'balancer' has failed: ('3.1ef1',)
Module 'balancer' has failed: ('3.1ef1',)
[WRN] POOL_TOO_MANY_PGS: 1 pools have too many placement groups
Pool volume-hdd has 2048 placement groups, should have 512
告警日志:
root@HA1120:~# systemctl status ceph-mgr@HA1120.service
● ceph-mgr@HA1120.service - Ceph cluster manager daemon
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-mgr@.service.d
└─ceph-after-pve-cluster.conf
Active: active (running) since Thu 2023-04-20 22:32:32 CST; 3 months 7 days ago
Main PID: 1861 (ceph-mgr)
Tasks: 70 (limit: 7372)
Memory: 1.8G
CGroup: /system.slice/system-ceph\x2dmgr.slice/ceph-mgr@HA1120.service
└─1861 /usr/bin/ceph-mgr -f --cluster ceph --id HA1120 --setuser ceph --setgroup ceph
Jul 28 16:53:15 HA1120 ceph-mgr[1861]: 2023-07-28T16:53:15.431+0800 7fe3149c4700 -1 Traceback (most recent call last):
Jul 28 16:53:15 HA1120 ceph-mgr[1861]: File "/usr/share/ceph/mgr/balancer/module.py", line 686, in serve
Jul 28 16:53:15 HA1120 ceph-mgr[1861]: r, detail = self.optimize(plan)
Jul 28 16:53:15 HA1120 ceph-mgr[1861]: File "/usr/share/ceph/mgr/balancer/module.py", line 966, in optimize
Jul 28 16:53:15 HA1120 ceph-mgr[1861]: return self.do_crush_compat(plan)
Jul 28 16:53:15 HA1120 ceph-mgr[1861]: File "/usr/share/ceph/mgr/balancer/module.py", line 1056, in do_crush_compat
Jul 28 16:53:15 HA1120 ceph-mgr[1861]: pe = self.calc_eval(ms, plan.pools)
Jul 28 16:53:15 HA1120 ceph-mgr[1861]: File "/usr/share/ceph/mgr/balancer/module.py", line 812, in calc_eval
Jul 28 16:53:15 HA1120 ceph-mgr[1861]: objects_by_osd[osd] += ms.pg_stat[pgid]['num_objects']
Jul 28 16:53:15 HA1120 ceph-mgr[1861]: KeyError: '3.1ef1'
root@HA1120:~# cat /var/log/ceph/ceph-mgr.HA1120.log | grep 3.1ef1
2023-07-28T16:53:15.431+0800 7fe3149c4700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'balancer' while running on mgr.HA1120: ('3.1ef1',)
KeyError: '3.1ef1'
处理办法:因为balancer属于mgr,找了一圈确实不知道如何处理,最后重启了 HA1120上的mgr服务就好了
如果文章对你有帮助,欢迎点击上方按钮打赏作者
暂无评论