카테고리 보관물: Object

Ceph MDS Pod Anti-Affinity Troubleshoot

kubernetes node upgrade등의 진행시 Ceph status에서 알람이 확인 되는 경우가 있습니다.
해당 현상 확인 및 처리 방법에 대한 정리내용 입니다.

work node 1번을 drain 시킨 후 확인된 알림

# ceph 상태 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     d874b4ea-8deb-4aa3-a3ac-e750180a6a5b
    health: HEALTH_WARN
            4 mgr modules have recently crashed

  services:
    mon: 3 daemons, quorum a,b,c (age 10h)
    mgr: b(active, since 5M), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 10h), 3 in (since 18M)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 113 pgs
    objects: 20.17k objects, 882 MiB
    usage:   7.6 GiB used, 82 GiB / 90 GiB avail
    pgs:     113 active+clean

  io:
    client:   852 B/s rd, 2 op/s rd, 0 op/s wr

# ceph pod 상태 확인
test@test-master-01:~$ kubectl -n rook-ceph get pod -o wide | egrep 'mgr|mds|mon|osd' 
rook-ceph-mds-myfs-a-77d484dc4-jddf9 2/2 Running 4 537d 172.16.118.75 test-worker-02 <none> <none> 
rook-ceph-mds-myfs-b-bd6ddc59b-l2b4t 2/2 Running 4 537d 172.16.118.72 test-worker-02 <none> <none> 
rook-ceph-mgr-a-7595f6b7d8-v2ww6 3/3 Running 8 546d 172.16.7.148 test-worker-03 <none> <none> 
rook-ceph-mgr-b-7cdf75cdb6-bmmgq 3/3 Running 0 171d 172.16.36.215 test-worker-01 <none> <none> 
rook-ceph-mon-a-54db4674f4-9z847 2/2 Running 6 546d 172.16.118.101 test-worker-02 <none> <none> 
rook-ceph-mon-b-54788d658b-wd658 2/2 Running 4 546d 172.16.36.230 test-worker-01 <none> <none> 
rook-ceph-mon-c-84f87b7c5-9z6ck 2/2 Running 4 546d 172.16.7.153 test-worker-03 <none> <none> 
rook-ceph-osd-0-788c4889ff-5gvcm 2/2 Running 0 171d 172.16.36.226 test-worker-01 <none> <none> 
rook-ceph-osd-1-7795c9dc4c-hzvqv 2/2 Running 0 171d 172.16.118.106 test-worker-02 <none> <none> 
rook-ceph-osd-2-6db8dc77dc-f8ct9 2/2 Running 0 171d 172.16.7.160 test-worker-03 <none> <none> 
rook-ceph-osd-prepare-test-worker-01-hldb7 0/1 Completed 0 314d <none> test-worker-01 <none> <none> 
rook-ceph-osd-prepare-test-worker-02-wv5rc 0/1 Completed 0 314d 172.16.118.84 test-worker-02 <none> <none> 
rook-ceph-osd-prepare-test-worker-03-pbnbb 0/1 Completed 0 314d 172.16.7.172 test-worker-03 <none> <none>

# MDS pod node 위치 확인
test@test-master-01:~$ kubectl -n rook-ceph get pod -o wide | egrep 'mds'
rook-ceph-mds-myfs-a-77d484dc4-jddf9                      2/2     Running     0                18s    172.16.118.75    test-worker-02   <none>           <none>
rook-ceph-mds-myfs-b-bd6ddc59b-l2b4t                      2/2     Running     0                18s    172.16.118.72    test-worker-02   <none>           <none>

# cephFS 상태 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status
myfs - 2 clients
====
RANK      STATE        MDS       ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      myfs-b  Reqs:    0 /s  35.9k  18.0k  4301      2
0-s   standby-replay  myfs-a  Evts:    0 /s  35.9k  18.0k  4301      0
      POOL         TYPE     USED  AVAIL
 myfs-metadata   metadata   629M  25.8G
myfs-replicated    data    12.0k  25.8G
   myfs-data0      data     982M  25.8G
MDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)

2개의 MDS pod가 기동중이며 active 상태이지만 2개 모두 woker node 2번에 집중되어 기동되고 있는 상태 입니다.
cephFS 서비스는 현재 이상이 없지만 향후 서비스 성능 리스크가 있을 것으로 예상 되므로 pod Anti-affinity 를 구성하여 MDS pod가 하나의 work node에 집중되는 것을 방지하도록 하겠습니다.

# pod Anti-affinity 반영
test@test-master-01:~$ kubectl -n rook-ceph patch cephfilesystem myfs --type='merge' -p '
> spec:
  metadataServer:
    placement:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values: ["rook-ceph-mds"]
            - key: rook_file_system
              operator: In
              values: ["myfs"]
          topologyKey: kubernetes.io/hostname
'
cephfilesystem.ceph.rook.io/myfs patched

# 적용 확인
test@test-master-01:~$ kubectl -n rook-ceph get cephfilesystem myfs -o yaml | sed -n '1,260p'
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"ceph.rook.io/v1","kind":"CephFilesystem","metadata":{"annotations":{},"name":"myfs","namespace":"rook-ceph"},"spec":{"dataPools":[{"replicated":{"size":3}}],"metadataPool":{"replicated":{"size":3}},"metadataServer":{"activeCount":1,"activeStandby":true},"preserveFilesystemOnDelete":true}}
  creationTimestamp: "2024-07-07T15:23:38Z"
  finalizers:
  - cephfilesystem.ceph.rook.io
  generation: 3
  name: myfs
  namespace: rook-ceph
  resourceVersion: "115648910"
  uid: 92cd0904-f6e1-4b15-853d-165b87be04d5
spec:
  dataPools:
  - application: ""
    erasureCoded:
      codingChunks: 0
      dataChunks: 0
    mirroring: {}
    quotas: {}
    replicated:
      size: 3
    statusCheck:
      mirror: {}
  metadataPool:
    application: ""
    erasureCoded:
      codingChunks: 0
      dataChunks: 0
    mirroring: {}
    quotas: {}
    replicated:
      size: 3
    statusCheck:
      mirror: {}
  metadataServer:
    activeCount: 1
    activeStandby: true
    placement:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - rook-ceph-mds
            - key: rook_file_system
              operator: In
              values:
              - myfs
          topologyKey: kubernetes.io/hostname
    resources: {}
  preserveFilesystemOnDelete: true
  statusCheck:
    mirror: {}
status:
  observedGeneration: 2
  phase: Ready

# worker node 1번 uncordon 후 worker node 2번을 drain 처리시 MDS pod 위치 worker node 1, 3번 확인
test@test-master-01:~$ kubectl -n rook-ceph get pod -o wide | egrep 'mds'
rook-ceph-mds-myfs-a-58846844d6-nd5mk                     2/2     Running     0             53s     172.16.36.216    test-worker-01   <none>           <none>
rook-ceph-mds-myfs-b-6b4d9476cb-q6b6p                     2/2     Running     0             38s     172.16.7.190     test-worker-03   <none>           <none>

# cephFS 상태 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status
myfs - 2 clients
====
RANK      STATE        MDS       ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      myfs-a  Reqs:    0 /s  35.9k  18.0k  4301      2
0-s   standby-replay  myfs-b  Evts:    0 /s  35.9k  18.0k  4301      0
      POOL         TYPE     USED  AVAIL
 myfs-metadata   metadata   621M  25.8G
myfs-replicated    data    12.0k  25.8G
   myfs-data0      data     982M  25.8G
MDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)

MDS pod의 위치는 의도대로 분산 deploy 되었지만 ceph status 에서는 여전히 HEALTH_WARN 상태 입니다.
해당 알람은 mgr pod event에 대한 내역으로 MDS 이슈 처리사항과는 무관하지만 event 확인 후 초기화 처리 하도록 하겠습니다.
mgr available status가 true 이면 상태는 정상입니다.

# ceph health의 mgr alarm 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     d874b4ea-8deb-4aa3-a3ac-e750180a6a5b
    health: HEALTH_WARN
            4 mgr modules have recently crashed

  services:
    mon: 3 daemons, quorum a,b,c (age 2m)
    mgr: b(active, since 10m), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 110s), 3 in (since 18M)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 113 pgs
    objects: 20.17k objects, 881 MiB
    usage:   6.3 GiB used, 84 GiB / 90 GiB avail
    pgs:     113 active+clean

  io:
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

# mgr 상태 정상 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph mgr stat
{
    "epoch": 476,
    "available": true,
    "active_name": "b",
    "num_standby": 1
}

# mgr crash 목록 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph crash ls
ID                                                                ENTITY  NEW
2025-12-26T09:01:17.354121Z_c76c6eaf-4bf7-4cf9-a9ec-f646fe857b76  mgr.b    *
2025-12-26T09:01:32.345473Z_4dfd271c-3d5b-4c89-88cf-13ba096f327b  mgr.b    *
2025-12-26T09:01:47.357321Z_0f938fb6-4c50-4b58-815d-5990fbe4bbb7  mgr.b    *
2025-12-26T09:02:02.329492Z_43d344a7-b71f-442e-a664-1852dda3a3f3  mgr.b    *
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph crash stat
4 crashes recorded

# mgr crash 이력 정리 후 health alarm 확인
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph crash archive-all
test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     d874b4ea-8deb-4aa3-a3ac-e750180a6a5b
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 6m)
    mgr: b(active, since 14m), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 5m), 3 in (since 18M)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 113 pgs
    objects: 20.17k objects, 881 MiB
    usage:   6.3 GiB used, 84 GiB / 90 GiB avail
    pgs:     113 active+clean

  io:
    client:   922 B/s rd, 1 op/s rd, 0 op/s wr

추가로 node drain, uncordon 등의 action 시 rebalancing 이 발생할 수 있으나 일정 시간 후에 재확인 하시면 HEALTH_OK 상태로 변경 확인할 수 있습니다.

test@test-master-01:~$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     d874b4ea-8deb-4aa3-a3ac-e750180a6a5b
    health: HEALTH_WARN
            1/3 mons down, quorum a,c
            4 mgr modules have recently crashed

  services:
    mon: 3 daemons, quorum a,c (age 0.275988s), out of quorum: b
    mgr: b(active, since 7m), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 8m), 3 in (since 18M)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 113 pgs
    objects: 20.17k objects, 881 MiB
    usage:   6.3 GiB used, 84 GiB / 90 GiB avail
    pgs:     113 active+clean

  io:
    client:   852 B/s rd, 1 op/s rd, 0 op/s wr

ceph dashboard OSD alarm troubleshoot

# 대시보드의 OSD 알람 처리

bash-4.4$ ceph status
  cluster:
    id: d874b4ea-8deb-4aa3-a3ac-e750180a6a5b
    health: HEALTH_WARN
            mon b is low on available space


  services:
    mon: 3 daemons, quorum a,b,c (age 4M)
    mgr: b(active, since 4M), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 4 osds: 3 up (since 4M), 3 in (since 13M)

  data:
    volumes: 1/1 healthy
    pools: 5 pools, 113 pgs
    objects: 20.17k objects, 871 MiB
    usage: 7.4 GiB used, 83 GiB / 90 GiB avail
    pgs: 113 active+clean

  io:
    client: 853 B/s rd, 2 op/s rd, 0 op/s wr

bash-4.4$ ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 90 GiB 83 GiB 7.4 GiB 7.4 GiB 8.22
TOTAL 90 GiB 83 GiB 7.4 GiB 7.4 GiB 8.22

--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 449 KiB 2 1.3 MiB 0 26 GiB
replicapool 3 32 338 MiB 126 1013 MiB 1.25 26 GiB
myfs-metadata 4 16 209 MiB 390 628 MiB 0.78 26 GiB
myfs-replicated 5 32 158 B 3.00k 12 KiB 0 26 GiB
myfs-data0 6 32 286 MiB 16.65k 983 MiB 1.21 26 GiB
bash-4.4$ ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
 0 k8s-worker-01 2515M 27.5G 0 0 1 90 exists,up
 1 k8s-worker-02 2511M 27.5G 0 0 0 0 exists,up
 2 k8s-worker-03 2544M 27.5G 0 0 1 16 exists,up
 3 0 0 0 0 0 0 autoout,exists,new

bash-4.4$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.08789 root default
-3 0.02930 host k8s-worker-01
 0 ssd 0.02930 osd.0 up 1.00000 1.00000
-5 0.02930 host k8s-worker-02
 1 ssd 0.02930 osd.1 up 1.00000 1.00000
-7 0.02930 host k8s-worker-03
 2 ssd 0.02930 osd.2 up 1.00000 1.00000
 3 0 osd.3 down 0 1.00000

# 미사용 osd 제거
bash-4.4$ ceph osd crush remove osd.3
device 'osd.3' does not appear in the crush map
bash-4.4$ ceph auth del osd.3
bash-4.4$ ceph osd rm 3
removed osd.3
bash-4.4$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.08789 root default
-3 0.02930 host k8s-worker-01
 0 ssd 0.02930 osd.0 up 1.00000 1.00000
-5 0.02930 host k8s-worker-02
 1 ssd 0.02930 osd.1 up 1.00000 1.00000
-7 0.02930 host k8s-worker-03
 2 ssd 0.02930 osd.2 up 1.00000 1.00000

ceph storage class 사용 wordpress, mysql pod 생성

영구 분산 볼륨을 사용하는 LAMP pod 생성 테스트

# Ceph Storage Class 생성

kubectl apply -f csi/rbd/storageclass.yaml

# service, pv가 포함된 mysql deployment 배포

kubectl create -f mysql.yaml
service/wordpress-mysql created
persistentvolumeclaim/mysql-pv-claim created
deployment.apps/wordpress-mysql created
 
kubectl get pvc
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
mysql-pv-claim   Bound    pvc-6d458ff1-54bf-4f34-b010-a0f4b2a0966e   20Gi       RWO            rook-ceph-block   94s
 

# service, pv가 포함된 wordpress deployment 배포
kubectl create -f wordpress.yaml
service/wordpress created
persistentvolumeclaim/wp-pv-claim created
deployment.apps/wordpress created
 

kubectl get pod
NAME                               READY   STATUS    RESTARTS   AGE
wordpress-7cf5c5c8b-5cgqk          1/1     Running   0          42s
wordpress-mysql-6f99c59595-9vs7z   1/1     Running   0          4m8s
 

# metallb 설치 후

kubectl get svc  wordpress
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
wordpress         LoadBalancer   10.96.232.248   10.1.4.160    80:31494/TCP   27s