swarm cluster 환경에서 특정 node의 모든 task가 restart되는 현상

Created 2022-02-07, Last modified 2022-02-14

Problem

3개의 manager node가 있는 swarm cluster 환경에서 특정 node에 배포된 service task가 동일한 시간에 모두 재배포 됨을 발견했다. 관련해 다른 node에서 발생했는지, service로 배포된 task만 재배포 되었는지 확인했다.

  • 특정 node만 발생했는지 => 해당 node가 속한 swarm cluster의 전체 node 목록은 아래와 같으며, 해당 node에서만 발생한것으로 확인했다.

$ docker node ls
ID                            HOSTNAME        STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
lo9isvanmlq88qq23ns6e1eur *   fs-store-repo   Ready     Active         Leader           20.10.5
4xb1pzaf14ronrzcggfgv70pn     fs-testbed-60   Ready     Active         Reachable        20.10.5
q2555giglkrp82c8ny7l8d26x     txstgfin        Ready     Active         Reachable        20.10.5
  • task만 재배포가 되었는지 standalone container도 재배포 되었는지 => 해당 node가 속한 swarm cluster에 배포된 모든 service task에 대해 해당 시간에 해당 node에서만 재배포가 발생했다(standalone container는 해당 현상이 발생하지 않음).

$ docker stack ps -f "node=fs-store-repo" monitoring_stack_server 
ID             NAME                                       IMAGE                                                     NODE            DESIRED STATE   CURRENT STATE           ERROR                              PORTS
y61hla6heibv   monitoring_stack_server_cassandra.1        harbor.danal.co.kr/docker_hub/library/cassandra:3.11.11   fs-store-repo   Running         Running 26 hours ago                                       
7jcjqojhei3w    \_ monitoring_stack_server_cassandra.1    harbor.danal.co.kr/docker_hub/library/cassandra:3.11.11   fs-store-repo   Shutdown        Shutdown 29 hours ago                                                                          
bhc880yzmsbh   monitoring_stack_server_consul.1           harbor.danal.co.kr/docker_hub/library/consul:1.10.4       fs-store-repo   Shutdown        Rejected 23 hours ago   "No such image: harbor.danal.c…"   
nveu6wvxiwgm    \_ monitoring_stack_server_consul.1       harbor.danal.co.kr/docker_hub/library/consul:1.10.4       fs-store-repo   Shutdown        Rejected 26 hours ago   "No such image: harbor.danal.c…"   
ku0w3xxzt6k5    \_ monitoring_stack_server_consul.1       harbor.danal.co.kr/docker_hub/library/consul:1.10.4       fs-store-repo   Shutdown        Rejected 26 hours ago   "No such image: harbor.danal.c…"   
1sxww9yru4su   monitoring_stack_server_cortex.2           harbor.danal.co.kr/quay_io/cortexproject/cortex:v1.11.0   fs-store-repo   Running         Running 26 hours ago                                       
h4c02q3uqe46    \_ monitoring_stack_server_cortex.2       harbor.danal.co.kr/quay_io/cortexproject/cortex:v1.11.0   fs-store-repo   Shutdown        Failed 29 hours ago     "task: non-zero exit (1)"          
q5czcqbq8bk4    \_ monitoring_stack_server_cortex.2       harbor.danal.co.kr/quay_io/cortexproject/cortex:v1.11.0   fs-store-repo   Shutdown        Failed 29 hours ago     "task: non-zero exit (1)"          
men9glmx7jtl    \_ monitoring_stack_server_cortex.2       harbor.danal.co.kr/quay_io/cortexproject/cortex:v1.11.0   fs-store-repo   Shutdown        Shutdown 29 hours ago                                      
k4c9ddaie5z8   monitoring_stack_server_grafana.1          harbor.danal.co.kr/docker_hub/grafana/grafana-oss:8.3.3   fs-store-repo   Shutdown        Failed 29 hours ago     "task: non-zero exit (1)"          
nny6zj8pox5b    \_ monitoring_stack_server_grafana.1      harbor.danal.co.kr/docker_hub/grafana/grafana-oss:8.3.3   fs-store-repo   Shutdown        Failed 29 hours ago     "task: non-zero exit (1)"          
ma6q08supcs1    \_ monitoring_stack_server_grafana.1      harbor.danal.co.kr/docker_hub/grafana/grafana-oss:8.3.3   fs-store-repo   Shutdown        Failed 29 hours ago     "task: non-zero exit (1)"          
q0tmv7upvjwo    \_ monitoring_stack_server_grafana.1      harbor.danal.co.kr/docker_hub/grafana/grafana-oss:8.3.3   fs-store-repo   Shutdown        Failed 29 hours ago     "task: non-zero exit (1)"          
lmszbxi5p6tj    \_ monitoring_stack_server_grafana.1      harbor.danal.co.kr/docker_hub/grafana/grafana-oss:8.3.3   fs-store-repo   Shutdown        Failed 29 hours ago     "task: non-zero exit (1)"          
vf3hm2cjy0a3    \_ monitoring_stack_server_grafana.1      harbor.danal.co.kr/docker_hub/grafana/grafana-oss:8.3.3   fs-store-repo   Shutdown        Shutdown 29 hours ago                                      
5yrv28cz9ws4   monitoring_stack_server_prometheus.2       harbor.danal.co.kr/docker_hub/prom/prometheus:v2.32.0     fs-store-repo   Running         Running 26 hours ago                                       
kvoenmvugw0h    \_ monitoring_stack_server_prometheus.2   harbor.danal.co.kr/docker_hub/prom/prometheus:v2.32.0     fs-store-repo   Shutdown        Shutdown 29 hours ago
$ docker ps -a -f "name=monitoring_stack_server"
CONTAINER ID   IMAGE                                                     COMMAND                  CREATED        STATUS                      PORTS                                         NAMES
acec0d3c49f9   harbor.danal.co.kr/docker_hub/grafana/grafana-oss:8.3.3   "/run.sh"                29 hours ago   Exited (1) 29 hours ago                                                   monitoring_stack_server_grafana.1.k4c9ddaie5z8909fkof4y3n16
c9f5e96b7bd2   harbor.danal.co.kr/docker_hub/grafana/grafana-oss:8.3.3   "/run.sh"                29 hours ago   Exited (1) 29 hours ago                                                   monitoring_stack_server_grafana.1.nny6zj8pox5b1fsobh8dt7279
4782a2484581   harbor.danal.co.kr/quay_io/cortexproject/cortex:v1.11.0   "/bin/cortex -config…"   29 hours ago   Up 29 hours                 80/tcp                                        monitoring_stack_server_cortex.2.1sxww9yru4suzung344w07skv
35888cd3025d   harbor.danal.co.kr/docker_hub/grafana/grafana-oss:8.3.3   "/run.sh"                29 hours ago   Exited (1) 29 hours ago                                                   monitoring_stack_server_grafana.1.ma6q08supcs1gullyjy9dn2y3
d25c66d7037a   harbor.danal.co.kr/docker_hub/grafana/grafana-oss:8.3.3   "/run.sh"                29 hours ago   Exited (1) 29 hours ago                                                   monitoring_stack_server_grafana.1.q0tmv7upvjwot49mwn58ennkd
b8b00aaaf420   harbor.danal.co.kr/quay_io/cortexproject/cortex:v1.11.0   "/bin/cortex -config…"   29 hours ago   Exited (1) 29 hours ago                                                   monitoring_stack_server_cortex.2.h4c02q3uqe465d29j5x8cdalv
ed1493f11979   harbor.danal.co.kr/docker_hub/prom/prometheus:v2.32.0     "/bin/prometheus --c…"   29 hours ago   Up 29 hours                 0.0.0.0:9090->9090/tcp                        monitoring_stack_server_prometheus.2.5yrv28cz9ws4qjkygrraxln34
fa1ad77cebf4   harbor.danal.co.kr/docker_hub/grafana/grafana-oss:8.3.3   "/run.sh"                29 hours ago   Exited (1) 29 hours ago                                                   monitoring_stack_server_grafana.1.lmszbxi5p6tjuuxuoicjtlb9w
ad0faf8d0463   harbor.danal.co.kr/quay_io/cortexproject/cortex:v1.11.0   "/bin/cortex -config…"   29 hours ago   Exited (1) 29 hours ago                                                   monitoring_stack_server_cortex.2.q5czcqbq8bk44y0j0sbittt9w
c6b6fdf455f6   harbor.danal.co.kr/docker_hub/library/cassandra:3.11.11   "./cassandra-init.sh"    29 hours ago   Up 29 hours                 7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp   monitoring_stack_server_cassandra.1.y61hla6heibv6219t3uusetop
9bfc9bba6959   harbor.danal.co.kr/docker_hub/grafana/grafana-oss:8.3.3   "/run.sh"                4 days ago     Exited (0) 29 hours ago                                                   monitoring_stack_server_grafana.1.vf3hm2cjy0a3lthq7v6wh41l2
cffbd9a7c80d   harbor.danal.co.kr/docker_hub/library/cassandra:3.11.11   "./cassandra-init.sh"    4 days ago     Exited (143) 29 hours ago                                                 monitoring_stack_server_cassandra.1.7jcjqojhei3w8akqg8j2m59ng
f9104ce653ee   harbor.danal.co.kr/docker_hub/prom/prometheus:v2.32.0     "/bin/prometheus --c…"   4 days ago     Exited (0) 29 hours ago                                                   monitoring_stack_server_prometheus.2.kvoenmvugw0he383y7bki36nk
7299ba281c4a   harbor.danal.co.kr/quay_io/cortexproject/cortex:v1.11.0   "/bin/cortex -config…"   3 weeks ago    Exited (137) 29 hours ago   

결론: 3개의 manager node로 구성된 swarm cluster 환경에서 특정 시간에 특정 node(192.168.136.61, fs-store-repo)의 standalone container를 제외한 service task가 재배포 되는 현상이 발생했다.

Analysis

특정 container가 아닌 swarm 내 특정 node의 모든 service task가 재배포된 이유를 확인하기 위해 각 node의 docker daemon의 로그를 확인했다.

  • 문제가 발생하지 않은 node(192.168.136.10, txstgfin)

# journalctl -u docker -S "2022-02-06 04:00" -U "2022-02-06 05:00"
...(생략)...
Feb 06 04:55:42 txstgfin dockerd[349591]: time="2022-02-06T04:55:42.880256768+09:00" level=info msg="NetworkDB stats txstgfin(43496003523c) - netID:0e9wx1tt26ur9uaidj5bas2cf leaving:false netPeers:3 entries:12 Queue qLen:0 netMsg/s:0"
Feb 06 04:55:42 txstgfin dockerd[349591]: time="2022-02-06T04:55:42.880288291+09:00" level=info msg="NetworkDB stats txstgfin(43496003523c) - netID:sodw73b5hbh2y0dfyenvuux7u leaving:false netPeers:3 entries:11 Queue qLen:0 netMsg/s:0"
Feb 06 04:58:00 txstgfin dockerd[349591]: time="2022-02-06T04:58:00.952318073+09:00" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Feb 06 04:58:02 txstgfin dockerd[349591]: time="2022-02-06T04:58:02.480257999+09:00" level=info msg="memberlist: Suspect b8a14f25127c has failed, no acks received"
Feb 06 04:58:02 txstgfin dockerd[349591]: time="2022-02-06T04:58:02.952694244+09:00" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Feb 06 04:58:04 txstgfin dockerd[349591]: time="2022-02-06T04:58:04.481218489+09:00" level=info msg="memberlist: Suspect b8a14f25127c has failed, no acks received"
Feb 06 04:58:04 txstgfin dockerd[349591]: time="2022-02-06T04:58:04.627705970+09:00" level=info msg="memberlist: Marking b8a14f25127c as failed, suspect timeout reached (0 peer confirmations)"
Feb 06 04:58:04 txstgfin dockerd[349591]: time="2022-02-06T04:58:04.630085505+09:00" level=info msg="Node b8a14f25127c/192.168.136.61, left gossip cluster"
Feb 06 04:58:04 txstgfin dockerd[349591]: time="2022-02-06T04:58:04.634415617+09:00" level=info msg="Node b8a14f25127c change state NodeActive --> NodeFailed"
Feb 06 04:58:04 txstgfin dockerd[349591]: time="2022-02-06T04:58:04.659908328+09:00" level=info msg="Node b8a14f25127c/192.168.136.61, added to failed nodes list"
Feb 06 04:58:04 txstgfin dockerd[349591]: time="2022-02-06T04:58:04.953035169+09:00" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Feb 06 04:58:06 txstgfin dockerd[349591]: time="2022-02-06T04:58:06.991076338+09:00" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Feb 06 04:58:08 txstgfin dockerd[349591]: time="2022-02-06T04:58:08.991374466+09:00" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Feb 06 04:58:09 txstgfin dockerd[349591]: time="2022-02-06T04:58:09.907004228+09:00" level=error msg="error while reading from stream" error="rpc error: code = Canceled desc = context canceled"
Feb 06 04:58:10 txstgfin dockerd[349591]: time="2022-02-06T04:58:10.227326496+09:00" level=warning msg="memberlist: Refuting a suspect message (from: b7a3bb9ca95d)"
Feb 06 04:58:10 txstgfin dockerd[349591]: time="2022-02-06T04:58:10.228009303+09:00" level=info msg="Node b8a14f25127c/192.168.136.61, joined gossip cluster"
Feb 06 04:58:10 txstgfin dockerd[349591]: time="2022-02-06T04:58:10.228077498+09:00" level=info msg="Node b8a14f25127c change state NodeFailed --> NodeActive"
Feb 06 04:58:10 txstgfin dockerd[349591]: time="2022-02-06T04:58:10.991829405+09:00" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Feb 06 04:58:12 txstgfin dockerd[349591]: time="2022-02-06T04:58:12.339077733+09:00" level=info msg="worker lo9isvanmlq88qq23ns6e1eur was successfully registered" method="(*Dispatcher).register"
Feb 06 04:58:12 txstgfin dockerd[349591]: time="2022-02-06T04:58:12.992106444+09:00" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Feb 06 04:58:14 txstgfin dockerd[349591]: time="2022-02-06T04:58:14.992428810+09:00" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Feb 06 04:58:16 txstgfin dockerd[349591]: time="2022-02-06T04:58:16.697882426+09:00" level=warning msg="rmServiceBinding 48c15a6819341f57c4c22b037b412873a4ddb596e9a2cc33c1ca02cbd136d2fe possible transient state ok:false entries:0 set:false "
Feb 06 04:58:16 txstgfin dockerd[349591]: time="2022-02-06T04:58:16.895395886+09:00" level=warning msg="rmServiceBinding a23b02259714bb72024ead4e13b650b908d47f4e8363839acd59882b72d9d87f possible transient state ok:false entries:0 set:false "
Feb 06 04:58:17 txstgfin dockerd[349591]: time="2022-02-06T04:58:17.698577541+09:00" level=warning msg="rmServiceBinding 9343adfc22e0e131a59ed4ece8a5f9d247a6b2bf48c7b4e6769c4fafa394ec5f possible transient state ok:false entries:0 set:false "
Feb 06 04:58:17 txstgfin dockerd[349591]: time="2022-02-06T04:58:17.895610042+09:00" level=warning msg="rmServiceBinding ea678c9ebc72740fe510f81a3075a523d38815712893d81a3a3e1c82ef862d83 possible transient state ok:false entries:0 set:false "
Feb 06 04:58:20 txstgfin dockerd[349591]: time="2022-02-06T04:58:20.094486586+09:00" level=warning msg="rmServiceBinding d6380d6a985ace4a5f33586ff98e90fabaf13cab325c151879093b22779cec8e possible transient state ok:false entries:0 set:false "
Feb 06 04:58:21 txstgfin dockerd[349591]: time="2022-02-06T04:58:21.096368212+09:00" level=warning msg="rmServiceBinding a2a45edb725f811c037888bc57ad4c8b3cae7ae970d069258f419c1ab4e4c62d possible transient state ok:false entries:0 set:false "
Feb 06 04:58:26 txstgfin dockerd[349591]: time="2022-02-06T04:58:26.293903524+09:00" level=warning msg="rmServiceBinding 954f50fe9c02726006300504e5220dd17719b191a0ce2f652473e0b657dc0268 possible transient state ok:false entries:0 set:false "
Feb 06 04:59:13 txstgfin dockerd[349591]: time="2022-02-06T04:59:13.121483635+09:00" level=warning msg="underweighting node lo9isvanmlq88qq23ns6e1eur for service u5f5n764qwwr1x8dexanrjmoi because it experienced 5 failures or rejections within 5m0s" module=node node.id=q2555giglkrp82c8ny7l8d26x
Feb 06 05:00:42 txstgfin dockerd[349591]: time="2022-02-06T05:00:42.883442420+09:00" level=info msg="NetworkDB stats txstgfin(43496003523c) - netID:z9jn6yve4s5h49lw92anpjnhs leaving:false netPeers:3 entries:19 Queue qLen:0 netMsg/s:0"
Feb 06 05:00:42 txstgfin dockerd[349591]: time="2022-02-06T05:00:42.883561671+09:00" level=info msg="NetworkDB stats txstgfin(43496003523c) - netID:0e9wx1tt26ur9uaidj5bas2cf leaving:false netPeers:3 entries:22 Queue qLen:0 netMsg/s:0"
...(생략)...

04:58:00 ~ 04:58:04(5~8 Line): cluster 내 특정 memberlist(b8a14f25127c, 192.168.136.61)에 대한 heartbeat 2회 실패함에 따라 해당 node가 문제가 발생했다고 의심(suspect)

04:58:04 ~ 04:58:04(9~12 Line): heartbeat에 실패한 memberlist(b8a14f25127c)가 gossip cluster에 더 이상 없다고 판단하며 상태(state)를 변경(NodeActive --> NodeFailed) 및 failed nodes list에 추가

04:58:04 ~ 04:58:09(13~16 Line): ??

04:58:10 ~ 04:58:10(17~19 Line): memberlist(b7a3bb9ca95d, 192.168.136.60)로 부터 위에서 결정된 의심(suspect)에 대한 반박(refuting). 즉, memberlist(b8a14f25127c)가 다시 gossip cluster에 참여(join)했기 때문에 상태를 변경(NodeFailed -> NodeActive)

04:58:10 ~ 04:58:14(20~23 Line): ??

04:58:16 ~ 04:59:13(24~31 Line): memberlist(b8a14f25127c) 내 service task 재배포와 관련된 로그

결론: 어떤 문제(아직 파악하지 못함)로 인해 cluster 내 특정 node(192.168.136.61, fs-store-repo)만 다른 node들과 heartbeat이 여러 차례 실패해 분리된 cluster를 구성한것으로 확인(04:58:00 ~ 04:58:10). 이후 다시 heartbeat이 성공함에 따라 cluster에 joint하게 되었으며, 이에 따라 task에 대한 업데이트를 위해 재배포가 이루어진 것으로 추측.

Resolution

문제의 정확한 원인은 파악하지 못했으나 github 내 issue 목록에서 관련된 내용을 조회 시, 동일한 현상이 발생하는 경우가 있음

docker swarm cluster 내 node의 healthcheck을 위한 heartbeat 주기 설정(docker swarm update 명령어의 --dispatcher-heartbeat flag)을 통해 문제가 해결한 케이스가 확인된다. 해당 flag를 이용해 주기를 더 크게 설정 및 모니터링 함으로써 동일한 문제가 발생하는지 확인하기로 함

$ docker swarm update --dispatcher-heartbeat 20s

결론: 위와 같이 swarm 클러스터 내 --dispatcher-heartbeat 기본값 5s => 20s 변경 후 1주일 동안 동일 현상이 재현되지 않음을 확인했다.

Etc

  • docker daemon 로그 조회 시, swarm 로그는 consul과 관련된 것으로 보이며 더 정확한 분석을 위해 consul 및 swarm의 세부 구현 확인 필요

  • docker daemon 로그 뿐만 아니라, docker raft log(swarm log)도 확인 필요

  • heartbeat timeout으로 발생한 상황인 것을 알았으나 정확하게 왜 heartbeat timeout이 발생했는지에 대한 분석은 할 수 없었다. 네트워크 레벨에서의 확인 필요

Last updated