文章目录
1、类别:scrub errors,pg inconsistent
#故障现象:
root@ceph01:~# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 5.33 is active+clean+inconsistent, acting [79,80,34]
#形成原因:
可能是之前,有过故障,出现了数据不一致的情况,当有新数据落入对应的osd后,发出告警,数据不落在该处时,发现不了,发现后及时修复即可。
ceph也会定期deep scrub,若是不能自愈就会告警
#查看对应osd的日志
root@hkhdd001:~# systemctl status ceph-osd@126.service
● ceph-osd@126.service - Ceph object storage daemon osd.126
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf
Active: active (running) since Wed 2023-02-22 18:22:11 HKT; 2 months 18 days ago
Process: 4145337 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 126 (code=exited, status=0/SUCCESS)
Main PID: 4145346 (ceph-osd)
Tasks: 61
Memory: 11.4G
CPU: 3d 2h 55min 2.691s
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@126.service
└─4145346 /usr/bin/ceph-osd -f --cluster ceph --id 126 --setuser ceph --setgroup ceph
May 12 08:22:46 hkhdd001 sudo[2570878]: pam_unix(sudo:session): session closed for user root
May 12 08:22:47 hkhdd001 sudo[2570898]: ceph : PWD=/ ; USER=root ; COMMAND=/usr/sbin/smartctl -x --json=o /dev/sdf
May 12 08:22:47 hkhdd001 sudo[2570898]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64045)
May 12 08:22:48 hkhdd001 sudo[2570898]: pam_unix(sudo:session): session closed for user root
May 12 08:22:49 hkhdd001 sudo[2570901]: ceph : PWD=/ ; USER=root ; COMMAND=/usr/sbin/nvme st10000ne000-3ap101 smart-log-add --json /dev/sdf
May 12 08:22:49 hkhdd001 sudo[2570901]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64045)
May 12 08:22:49 hkhdd001 sudo[2570901]: pam_unix(sudo:session): session closed for user root
May 12 13:33:05 hkhdd001 ceph-osd[4145346]: 2023-05-12T13:33:05.831+0800 7f83d871a700 -1 log_channel(cluster) log [ERR] : 5.2cc shard 146 soid 5:33551cd7:::rbd_data.2e64f3c9858639.000000000003a562:head : candidate >
May 12 13:47:29 hkhdd001 ceph-osd[4145346]: 2023-05-12T13:47:29.097+0800 7f83d871a700 -1 log_channel(cluster) log [ERR] : 5.2cc deep-scrub 0 missing, 1 inconsistent objects
May 12 13:47:29 hkhdd001 ceph-osd[4145346]: 2023-05-12T13:47:29.097+0800 7f83d871a700 -1 log_channel(cluster) log [ERR] : 5.2cc deep-scrub 1 errors
#处理方式:
root@ceph01:~# ceph pg repair 5.33
instructing pg 5.33 on osd.79 to repair
#另一种处理思路:
systemctl restart ceph-osd@79.service
2、类别:Module 'devicehealth' has failed: disk I/O error
#故障现象:
root@A1:~# ceph -s
cluster:
id: c9732c40-e843-4865-8f73-9e61551c993d
health: HEALTH_ERR
Module 'devicehealth' has failed: disk I/O error
root@A1:~# ceph health detail
HEALTH_ERR Module 'devicehealth' has failed: disk I/O error; 1 mgr modules have recently crashed
[ERR] MGR_MODULE_ERROR: Module 'devicehealth' has failed: disk I/O error
Module 'devicehealth' has failed: disk I/O error
#解决办法:
新创建mgr,让pve自动创建一个 .mgr的pool,创建后,err告警就会小时
注意:ceph的不同版本,自动创建的这个.mgr 健康健康pool 名字不一样
3、类别:osd full
如果文章对你有帮助,欢迎点击上方按钮打赏作者
暂无评论