Ceph MDS Pod Anti-Affinity Troubleshoot

kubernetes node upgrade등의 진행시 Ceph status에서 알람이 확인 되는 경우가 있습니다.
해당 현상 확인 및 처리 방법에 대한 정리내용 입니다.

work node 1번을 drain 시킨 후 확인된 알림

# ceph 상태 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     d874b4ea-8deb-4aa3-a3ac-e750180a6a5b
    health: HEALTH_WARN
            4 mgr modules have recently crashed

  services:
    mon: 3 daemons, quorum a,b,c (age 10h)
    mgr: b(active, since 5M), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 10h), 3 in (since 18M)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 113 pgs
    objects: 20.17k objects, 882 MiB
    usage:   7.6 GiB used, 82 GiB / 90 GiB avail
    pgs:     113 active+clean

  io:
    client:   852 B/s rd, 2 op/s rd, 0 op/s wr

# ceph pod 상태 확인
test@test-master-01:~$ kubectl -n rook-ceph get pod -o wide | egrep 'mgr|mds|mon|osd' 
rook-ceph-mds-myfs-a-77d484dc4-jddf9 2/2 Running 4 537d 172.16.118.75 test-worker-02 <none> <none> 
rook-ceph-mds-myfs-b-bd6ddc59b-l2b4t 2/2 Running 4 537d 172.16.118.72 test-worker-02 <none> <none> 
rook-ceph-mgr-a-7595f6b7d8-v2ww6 3/3 Running 8 546d 172.16.7.148 test-worker-03 <none> <none> 
rook-ceph-mgr-b-7cdf75cdb6-bmmgq 3/3 Running 0 171d 172.16.36.215 test-worker-01 <none> <none> 
rook-ceph-mon-a-54db4674f4-9z847 2/2 Running 6 546d 172.16.118.101 test-worker-02 <none> <none> 
rook-ceph-mon-b-54788d658b-wd658 2/2 Running 4 546d 172.16.36.230 test-worker-01 <none> <none> 
rook-ceph-mon-c-84f87b7c5-9z6ck 2/2 Running 4 546d 172.16.7.153 test-worker-03 <none> <none> 
rook-ceph-osd-0-788c4889ff-5gvcm 2/2 Running 0 171d 172.16.36.226 test-worker-01 <none> <none> 
rook-ceph-osd-1-7795c9dc4c-hzvqv 2/2 Running 0 171d 172.16.118.106 test-worker-02 <none> <none> 
rook-ceph-osd-2-6db8dc77dc-f8ct9 2/2 Running 0 171d 172.16.7.160 test-worker-03 <none> <none> 
rook-ceph-osd-prepare-test-worker-01-hldb7 0/1 Completed 0 314d <none> test-worker-01 <none> <none> 
rook-ceph-osd-prepare-test-worker-02-wv5rc 0/1 Completed 0 314d 172.16.118.84 test-worker-02 <none> <none> 
rook-ceph-osd-prepare-test-worker-03-pbnbb 0/1 Completed 0 314d 172.16.7.172 test-worker-03 <none> <none>

# MDS pod node 위치 확인
test@test-master-01:~$ kubectl -n rook-ceph get pod -o wide | egrep 'mds'
rook-ceph-mds-myfs-a-77d484dc4-jddf9                      2/2     Running     0                18s    172.16.118.75    test-worker-02   <none>           <none>
rook-ceph-mds-myfs-b-bd6ddc59b-l2b4t                      2/2     Running     0                18s    172.16.118.72    test-worker-02   <none>           <none>

# cephFS 상태 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status
myfs - 2 clients
====
RANK      STATE        MDS       ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      myfs-b  Reqs:    0 /s  35.9k  18.0k  4301      2
0-s   standby-replay  myfs-a  Evts:    0 /s  35.9k  18.0k  4301      0
      POOL         TYPE     USED  AVAIL
 myfs-metadata   metadata   629M  25.8G
myfs-replicated    data    12.0k  25.8G
   myfs-data0      data     982M  25.8G
MDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)

2개의 MDS pod가 기동중이며 active 상태이지만 2개 모두 woker node 2번에 집중되어 기동되고 있는 상태 입니다.
cephFS 서비스는 현재 이상이 없지만 향후 서비스 성능 리스크가 있을 것으로 예상 되므로 pod Anti-affinity 를 구성하여 MDS pod가 하나의 work node에 집중되는 것을 방지하도록 하겠습니다.

# pod Anti-affinity 반영
test@test-master-01:~$ kubectl -n rook-ceph patch cephfilesystem myfs --type='merge' -p '
> spec:
  metadataServer:
    placement:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values: ["rook-ceph-mds"]
            - key: rook_file_system
              operator: In
              values: ["myfs"]
          topologyKey: kubernetes.io/hostname
'
cephfilesystem.ceph.rook.io/myfs patched

# 적용 확인
test@test-master-01:~$ kubectl -n rook-ceph get cephfilesystem myfs -o yaml | sed -n '1,260p'
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"ceph.rook.io/v1","kind":"CephFilesystem","metadata":{"annotations":{},"name":"myfs","namespace":"rook-ceph"},"spec":{"dataPools":[{"replicated":{"size":3}}],"metadataPool":{"replicated":{"size":3}},"metadataServer":{"activeCount":1,"activeStandby":true},"preserveFilesystemOnDelete":true}}
  creationTimestamp: "2024-07-07T15:23:38Z"
  finalizers:
  - cephfilesystem.ceph.rook.io
  generation: 3
  name: myfs
  namespace: rook-ceph
  resourceVersion: "115648910"
  uid: 92cd0904-f6e1-4b15-853d-165b87be04d5
spec:
  dataPools:
  - application: ""
    erasureCoded:
      codingChunks: 0
      dataChunks: 0
    mirroring: {}
    quotas: {}
    replicated:
      size: 3
    statusCheck:
      mirror: {}
  metadataPool:
    application: ""
    erasureCoded:
      codingChunks: 0
      dataChunks: 0
    mirroring: {}
    quotas: {}
    replicated:
      size: 3
    statusCheck:
      mirror: {}
  metadataServer:
    activeCount: 1
    activeStandby: true
    placement:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-mds
            - key: rook_file_system
              operator: In
              values:
              - myfs
          topologyKey: kubernetes.io/hostname
    resources: {}
  preserveFilesystemOnDelete: true
  statusCheck:
    mirror: {}
status:
  observedGeneration: 2
  phase: Ready

# worker node 1번 uncordon 후 worker node 2번을 drain 처리시 MDS pod 위치 worker node 1, 3번 확인
test@test-master-01:~$ kubectl -n rook-ceph get pod -o wide | egrep 'mds'
rook-ceph-mds-myfs-a-58846844d6-nd5mk                     2/2     Running     0             53s     172.16.36.216    test-worker-01   <none>           <none>
rook-ceph-mds-myfs-b-6b4d9476cb-q6b6p                     2/2     Running     0             38s     172.16.7.190     test-worker-03   <none>           <none>

# cephFS 상태 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status
myfs - 2 clients
====
RANK      STATE        MDS       ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      myfs-a  Reqs:    0 /s  35.9k  18.0k  4301      2
0-s   standby-replay  myfs-b  Evts:    0 /s  35.9k  18.0k  4301      0
      POOL         TYPE     USED  AVAIL
 myfs-metadata   metadata   621M  25.8G
myfs-replicated    data    12.0k  25.8G
   myfs-data0      data     982M  25.8G
MDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)

MDS pod의 위치는 의도대로 분산 deploy 되었지만 ceph status 에서는 여전히 HEALTH_WARN 상태 입니다.
해당 알람은 mgr pod event에 대한 내역으로 MDS 이슈 처리사항과는 무관하지만 event 확인 후 초기화 처리 하도록 하겠습니다.
mgr available status가 true 이면 상태는 정상입니다.

# ceph health의 mgr alarm 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     d874b4ea-8deb-4aa3-a3ac-e750180a6a5b
    health: HEALTH_WARN
            4 mgr modules have recently crashed

  services:
    mon: 3 daemons, quorum a,b,c (age 2m)
    mgr: b(active, since 10m), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 110s), 3 in (since 18M)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 113 pgs
    objects: 20.17k objects, 881 MiB
    usage:   6.3 GiB used, 84 GiB / 90 GiB avail
    pgs:     113 active+clean

  io:
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

# mgr 상태 정상 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph mgr stat
{
    "epoch": 476,
    "available": true,
    "active_name": "b",
    "num_standby": 1
}

# mgr crash 목록 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph crash ls
ID                                                                ENTITY  NEW
2025-12-26T09:01:17.354121Z_c76c6eaf-4bf7-4cf9-a9ec-f646fe857b76  mgr.b    *
2025-12-26T09:01:32.345473Z_4dfd271c-3d5b-4c89-88cf-13ba096f327b  mgr.b    *
2025-12-26T09:01:47.357321Z_0f938fb6-4c50-4b58-815d-5990fbe4bbb7  mgr.b    *
2025-12-26T09:02:02.329492Z_43d344a7-b71f-442e-a664-1852dda3a3f3  mgr.b    *
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph crash stat
4 crashes recorded

# mgr crash 이력 정리 후 health alarm 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph crash archive-all
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     d874b4ea-8deb-4aa3-a3ac-e750180a6a5b
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 6m)
    mgr: b(active, since 14m), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 5m), 3 in (since 18M)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 113 pgs
    objects: 20.17k objects, 881 MiB
    usage:   6.3 GiB used, 84 GiB / 90 GiB avail
    pgs:     113 active+clean

  io:
    client:   922 B/s rd, 1 op/s rd, 0 op/s wr

추가로 node drain, uncordon 등의 action 시 rebalancing 이 발생할 수 있으나 일정 시간 후에 재확인 하시면 HEALTH_OK 상태로 변경 확인할 수 있습니다.

test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     d874b4ea-8deb-4aa3-a3ac-e750180a6a5b
    health: HEALTH_WARN
            1/3 mons down, quorum a,c
            4 mgr modules have recently crashed

  services:
    mon: 3 daemons, quorum a,c (age 0.275988s), out of quorum: b
    mgr: b(active, since 7m), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 8m), 3 in (since 18M)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 113 pgs
    objects: 20.17k objects, 881 MiB
    usage:   6.3 GiB used, 84 GiB / 90 GiB avail
    pgs:     113 active+clean

  io:
    client:   852 B/s rd, 1 op/s rd, 0 op/s wr

답글 남기기

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다

이 사이트는 Akismet을 사용하여 스팸을 줄입니다. 댓글 데이터가 어떻게 처리되는지 알아보세요.