ceph常见WARN告警的处理方式

1、类别:Too many repaired reads

file
解决办法:重启对应的OSD

2、类别:recently crashed

file

#查看具体的信息
[root@node14 ~]# ceph crash ls-new 
ID                                                                ENTITY  NEW  
2023-01-12T09:23:41.905887Z_45bae4d7-2197-4050-bd28-b3b371650af8  osd.62   *

#处理方式:归档
    #ceph crash achive <crash-id>        #可以一条一条的进行归档
    ceph crash archive-all            #建议归档全部告警,就不会再显示了

3、类别:allowing insecure global_id reclaim

file

#处理方式:
root@ceph01:~# ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED

4、类别:client is using insecure global_id reclaim

问题:若是未启用 msgrv2 ,监视器被允许客户端使用不安全的身份认证 (mons are allowing insecure global_id reclaim)

#解决办法:在任一集群节点执行
ceph mon enable-msgr2
ceph config set mon auth_allow_insecure_global_id_reclaim false
ceph config set mon auth_expose_insecure_global_id_reclaim false        #说是官方文档没有这个命令,没有就不执行,就在ceph的配置数据库中看不到

5、类别:pgs not deep-scrubbed in time

告警的现象和详细信息
[root@node14 ~]# ceph -s
  cluster:
    id:     3a9af753-3f48-43d8-b0e3-b6a3189f41e7
    health: HEALTH_WARN
            3 pgs not deep-scrubbed in time

[root@node14 ~]# ceph health detail 
HEALTH_WARN 3 pgs not deep-scrubbed in time
[WRN] PG_NOT_DEEP_SCRUBBED: 3 pgs not deep-scrubbed in time
    pg 5.316 not deep-scrubbed since 2023-03-28T10:31:44.693734-0700
    pg 5.d not deep-scrubbed since 2023-03-28T12:26:42.875821-0700
    pg 5.125 not deep-scrubbed since 2023-03-28T12:35:33.828058-0700

告警形成的原因
    1、由于部分PG没有deep-scrubbed造成,手动对异常PG进行deep-scrubb清理及可

修复方式
    ceph pg deep-scrub 5.316
    ceph pg deep-scrub 5.d 
    ceph pg deep-scrub 5.125

6、类别:mons down

file

修复办法:(保证有半数以上的mon存活,集群才可用)
    方式一:启动down掉的mon

    方式二:老的mon已经故障了,无法上线了,剔除老的mon,再添加新的mon
root@cunchu4:~# ceph mon stat
e4: 4 mons at {cunchu1=[v2:192.168.19.8:3300/0,v1:192.168.19.8:6789/0],cunchu2=[v2:192.168.19.9:3300/0,v1:192.168.19.9:6789/0],cunchu3=[v2:192.168.19.10:3300/0,v1:192.168.19.10:6789/0],cunchu4=[v2:192.168.19.11:3300/0,v1:192.168.19.11:6789/0]}, election epoch 84, leader 1 cunchu2, quorum 1,2,3 cunchu2,cunchu3,cunchu4

root@cunchu4:~# ceph mon remove cunchu1
removing mon.cunchu1 at [v2:192.168.19.8:3300/0,v1:192.168.19.8:6789/0], there will be 3 monitors

7、类别:nearfull osd 或者backfillfull osd 或者 pool nearfull

#告警现象:
HEALTH_WARN 1 nearfull osd(s); Degraded data redundancy: 1 pg undersized; 1 pool(s) nearfull
[WRN] OSD_NEARFULL: 1 nearfull osd(s)
    osd.2 is near full
[WRN] PG_DEGRADED: Degraded data redundancy: 1 pg undersized
    pg 12.51 is stuck undersized for 3m, current state active+recovering+undersized+remapped, last acting [11,0]
[WRN] POOL_NEARFULL: 1 pool(s) nearfull
    pool 'volumes' is nearfull

#告警说明:
    当出现pool nearfull时,多半也出现了osd nearfull。因为默认情况下,只有osd 出现nearfull 后才会导致pool nearfull

#临时解决措施
    方式一:(ceph16及以前版本适用)
        #单独提升某个osd的ratio(默认这三个值是0.95、0.9、0.85)
        #若是osd个数较多,可以先设置osd的维护标志,再调整ratio,最后在unset 标志
            #ceph osd set noout
            #ceph osd set noscrub
            #ceph osd set nodeep-scrub
        ceph tell osd.11 injectargs '--mon_osd_full_ratio 0.97 --mon_osd_backfillfull_ratio 0.95 --mon_osd_nearfull_ratio 0.9'        #即时生效

        #或者直接修改所在node下的所有osd
        for osd in $(ceph osd ls-tree $HOSTNAME); do ceph tell osd.$osd injectargs '--mon_osd_full_ratio 0.97 --mon_osd_backfillfull_ratio 0.95 --mon_osd_nearfull_ratio 0.9'; done 

    方拾二:(ceph16及以前版本适用)
        #和方式一一样的,只是这个要重启osd才能生效
        ceph daemon osd.11 config set mon_osd_full_ratio 0.97
        ceph daemon osd.11 config set mon_osd_backfillfull_ratio 0.95
        ceph daemon osd.11 config set mon_osd_nearfull_ratio 0.9
        systemctl restart ceph-osd@11

    方式三:(适用于所有ceph版本)
        #降低告警对应osd的reweight【或者增加其他osd的reweight】,让其上的数据迁移到别的osd
        ceph osd reweight osd.11 0.9        #默认所有osd的reweight都是1

#根本解决办法:扩容

8、类别:OSD(s) experiencing BlueFS spillover

(DB分区溢出-----一般出现在 ssd作为db,hdd作为数据盘时)

故障告警现象

告警现象
[root@node14 ~]# ceph -s
  cluster:
    id:     3a9af753-3f48-43d8-b0e3-b6a3189f41e7
    health: HEALTH_WARN
            6 OSD(s) experiencing BlueFS spillover

[root@node14 ~]# ceph health detail 
HEALTH_WARN 6 OSD(s) experiencing BlueFS spillover
[WRN] BLUEFS_SPILLOVER: 6 OSD(s) experiencing BlueFS spillover
     osd.7 spilled over 1.0 GiB metadata from 'db' device (8.0 GiB used of 30 GiB) to slow device
     osd.17 spilled over 1.6 GiB metadata from 'db' device (7.0 GiB used of 30 GiB) to slow device
     osd.59 spilled over 1.3 GiB metadata from 'db' device (7.9 GiB used of 30 GiB) to slow device
     osd.65 spilled over 1.4 GiB metadata from 'db' device (7.8 GiB used of 30 GiB) to slow device
     osd.71 spilled over 972 MiB metadata from 'db' device (7.9 GiB used of 30 GiB) to slow device
     osd.77 spilled over 1.2 GiB metadata from 'db' device (8.9 GiB used of 30 GiB) to slow device

在对应的osd主机上查看(查看wal 和 db的实际占用情况)
root@lahost001:/var/lib/ceph/osd# ceph daemon osd.7 perf dump |grep bluefs -A 10
    "bluefs": {
        "gift_bytes": 0,
        "reclaim_bytes": 0,
        "db_total_bytes": 32212246528,
        "db_used_bytes": 8571052032,
        "wal_total_bytes": 0,
        "wal_used_bytes": 0,
        "slow_total_bytes": 160031375360,
        "slow_used_bytes": 1121583104,        #注意这里
        "num_files": 195,
        "log_bytes": 6475776,
--
        "bluefs_bytes": 78768,
        "bluefs_items": 2266,
        "bluefs_file_reader_bytes": 25757952,
        "bluefs_file_reader_items": 374,
        "bluefs_file_writer_bytes": 896,
        "bluefs_file_writer_items": 4,
        "buffer_anon_bytes": 3670925,
        "buffer_anon_items": 5855,
        "buffer_meta_bytes": 978824,
        "buffer_meta_items": 11123,
        "osd_bytes": 1629936,
        "osd_items": 126,
        "osd_mapbl_bytes": 0,
        "osd_mapbl_items": 0,
        "osd_pglog_bytes": 1990084680,
        "osd_pglog_items": 3989621,

故障告警处理方式(压缩)

压缩命令:
ceph daemon osd.{id} compact        #尝试压缩临时解决办法(这条命令执行时间有点长)

查看执行压缩后的结果
root@lahost001:/var/lib/ceph/osd# ceph daemon osd.7 compact
{
    "elapsed_time": 456.12800944000003
}
root@lahost001:/var/lib/ceph/osd# ceph daemon osd.7 perf dump |grep bluefs -A 10
    "bluefs": {
        "gift_bytes": 0,
        "reclaim_bytes": 0,
        "db_total_bytes": 32212246528,
        "db_used_bytes": 4235190272,
        "wal_total_bytes": 0,
        "wal_used_bytes": 0,
        "slow_total_bytes": 160031375360,
        "slow_used_bytes": 0,
        "num_files": 80,
        "log_bytes": 10592256,
--
        "bluefs_bytes": 36680,
        "bluefs_items": 1327,
        "bluefs_file_reader_bytes": 7851648,
        "bluefs_file_reader_items": 154,
        "bluefs_file_writer_bytes": 896,
        "bluefs_file_writer_items": 4,
        "buffer_anon_bytes": 692095,
        "buffer_anon_items": 6970,
        "buffer_meta_bytes": 1295712,
        "buffer_meta_items": 14724,
        "osd_bytes": 1629936,
        "osd_items": 126,
        "osd_mapbl_bytes": 0,
        "osd_mapbl_items": 0,
        "osd_pglog_bytes": 1989875000,
        "osd_pglog_items": 3989208,

备注:这可以压缩db分区内的数据,减小大小,错误可能会消失,但如果溢出过多、可用容量太小则不一定有效。如果你不希望继续碰到该问题,则有以下2个根本性解决方案

参考文章:https://www.cnweed.com/archives/4328/

1:设置db占用容量(请参考下文)
    https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing

2:迁移db至更大的分区

9、类别:mon store is getting too big

原因:ceph的leveldb数据量过大   (类似db溢出)

解决办法:
    ceph tell mon.pve-ceph01 compact
    #ceph daemon mon.pve-ceph01 compact

9、类别:MON_DISK_LOW

MON_DISK_LOW: mon pve-ceph01 is low on available space
极有可能是,mon所在节点的磁盘空间不够了,清理下,就能恢复

10、类别:no active mgr

当存储中时间不一致时,可能出现 类别:no active mgr 告警

11、类别:osd full(最好的解决办法是扩容)


再针对性调整ratio
ceph tell osd.4 injectargs '--mon_osd_full_ratio 0.95 --mon_osd_nearfull_ratio 0.95 --mon_osd_backfillfull_ratio 0.95'
声明:本文为原创,作者为 辣条①号,转载时请保留本声明及附带文章链接:https://boke.wsfnk.com/archives/1076.html
谢谢你请我吃辣条谢谢你请我吃辣条

如果文章对你有帮助,欢迎点击上方按钮打赏作者

最后编辑于:2023/5/15作者: 辣条①号

现在在做什么? 接下来打算做什么? 你的目标什么? 期限还有多少? 进度如何? 不负遇见,不谈亏欠!

暂无评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注

arrow grin ! ? cool roll eek evil razz mrgreen smile oops lol mad twisted wink idea cry shock neutral sad ???

文章目录