ceph常见故障之(OSD无法启动)

系统重启后,因lvm信息丢失 OSD无法启动

其他人也遇到过该问题 请点击该处查看

手动调试,启动osd(失败)

root@node16072:~# /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
2023-05-04T21:39:39.250+0800 7f8d9aae2240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-0/keyring: (2) No such file or directory
2023-05-04T21:39:39.250+0800 7f8d9aae2240 -1 AuthRegistry(0x56400f1ac140) no keyring found at /var/lib/ceph/osd/ceph-0/keyring, disabling cephx
2023-05-04T21:39:39.250+0800 7f8d9aae2240 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-0/keyring: (2) No such file or directory
2023-05-04T21:39:39.250+0800 7f8d9aae2240 -1 AuthRegistry(0x7ffe9af34500) no keyring found at /var/lib/ceph/osd/ceph-0/keyring, disabling cephx
failed to fetch mon config (--no-mon-config to skip)

查看系统日志

root@node16072:~# dmesg -T |grep ceph 
[Thu May  4 21:32:58 2023] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[Thu May  4 21:32:58 2023] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[Thu May  4 21:32:58 2023] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[Thu May  4 21:32:58 2023] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[Thu May  4 21:32:58 2023] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[Thu May  4 21:32:58 2023] systemd[1]: remote-fs-pre.target: Found dependency on ceph.target/start
[Thu May  4 21:32:58 2023] systemd[1]: remote-fs-pre.target: Found dependency on ceph-mds.target/start
[Thu May  4 21:32:58 2023] systemd[1]: remote-fs-pre.target: Found dependency on ceph-mon.target/start
[Thu May  4 21:32:58 2023] systemd[1]: remote-fs-pre.target: Found dependency on ceph-mon@node16072.service/start
[Thu May  4 21:32:58 2023] systemd[1]: remote-fs-pre.target: Found ordering cycle on ceph-mon@node16072.service/start
[Thu May  4 21:32:58 2023] systemd[1]: remote-fs-pre.target: Job ceph-mon@node16072.service/start deleted to break ordering cycle starting with remote-fs-pre.target/start
[Thu May  4 21:32:58 2023] systemd[1]: ceph-mgr@node16072.service: Found ordering cycle on pve-cluster.service/start
[Thu May  4 21:32:58 2023] systemd[1]: ceph-mgr@node16072.service: Found dependency on rrdcached.service/start
[Thu May  4 21:32:58 2023] systemd[1]: ceph-mgr@node16072.service: Found dependency on remote-fs.target/start
[Thu May  4 21:32:58 2023] systemd[1]: ceph-mgr@node16072.service: Found dependency on remote-fs-pre.target/start
[Thu May  4 21:32:58 2023] systemd[1]: ceph-mgr@node16072.service: Found dependency on ceph-mgr@node16072.service/start
[Thu May  4 21:32:58 2023] systemd[1]: ceph-mgr@node16072.service: Job pve-cluster.service/start deleted to break ordering cycle starting with ceph-mgr@node16072.service/start
[Thu May  4 21:32:58 2023] systemd[1]: Created slice system-ceph\x2dmgr.slice.
[Thu May  4 21:32:58 2023] systemd[1]: Created slice system-ceph\x2dmon.slice.
[Thu May  4 21:32:58 2023] systemd[1]: Created slice system-ceph\x2dvolume.slice.
[Thu May  4 21:32:58 2023] systemd[1]: Reached target ceph target allowing to start/stop all ceph-fuse@.service instances at once.
[Thu May  4 21:32:58 2023] systemd[1]: Reached target ceph target allowing to start/stop all ceph-mon@.service instances at once.
[Thu May  4 21:32:58 2023] systemd[1]: Reached target ceph target allowing to start/stop all ceph-mds@.service instances at once.
[Thu May  4 21:32:58 2023] systemd[1]: Reached target ceph target allowing to start/stop all ceph-osd@.service instances at once.

可能的原因

# 1、osd在创建的时候,lvm2-lvmetad.service 和 lvm2-lvmetad.socket  一定要是运行的,不然机器重启后,可能会导致磁盘上lvm信息丢失

# 2、最好的办法就是 systemctl enable lvm2-lvmetad.service 然后重启机器
    systemctl start lvm2-lvmetad.service
    systemctl enable lvm2-lvmetad.service

# 3、若是 在创建osd时 lvm2-lvmetad 未运行,那么osd的日志中 会报 WARNING。

解决办法

# 解决办法
    systemctl start ceph-volume@lvm-1-fb045fd1-ce5b-4503-a37e-1c63061058ab.service

    # 不行的话试试
    /usr/sbin/ceph-volume lvm trigger 1-fb045fd1-ce5b-4503-a37e-1c63061058ab

修改ceph.conf 中osd的配置后,无法启动osd,出现 start-limit-hit

报如下错误
    root@fuse01:~# systemctl status ceph-osd@3.service 
    ● ceph-osd@3.service - Ceph object storage daemon osd.3
        Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
        Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
                └─ceph-after-pve-cluster.conf
        Active: failed (Result: start-limit-hit) since Fri 2023-12-08 13:39:54 CST; 5min ago
        Process: 2686756 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 3 (code=exited, status=0/SUCCESS)
        Process: 2686760 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 3 --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
    Main PID: 2686760 (code=exited, status=0/SUCCESS)
            CPU: 23.794s

    Dec 08 13:41:11 fuse01 systemd[1]: Failed to start Ceph object storage daemon osd.3.
    Dec 08 13:41:33 fuse01 systemd[1]: ceph-osd@3.service: Start request repeated too quickly.
    Dec 08 13:41:33 fuse01 systemd[1]: ceph-osd@3.service: Failed with result 'start-limit-hit'.

解决办法
    root@fuse01:~# systemctl daemon-reload
    root@fuse01:~# systemctl restart ceph-osd@3
    root@fuse01:~# systemctl restart ceph-osd@2
声明:本文为原创,作者为 辣条①号,转载时请保留本声明及附带文章链接:https://boke.wsfnk.com/archives/1157.html
谢谢你请我吃辣条谢谢你请我吃辣条

如果文章对你有帮助,欢迎点击上方按钮打赏作者

最后编辑于:2023/12/8作者: 辣条①号

目标:网络规划设计师、系统工程师、ceph存储工程师、云计算工程师。 不负遇见,不谈亏欠!

暂无评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注

arrow grin ! ? cool roll eek evil razz mrgreen smile oops lol mad twisted wink idea cry shock neutral sad ???

文章目录