Skip to content

Commit 20258e8

Browse files
committed
Copy edits and other clean up for post
1 parent 357e3e6 commit 20258e8

File tree

2 files changed

+49
-20
lines changed

2 files changed

+49
-20
lines changed
Lines changed: 49 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,61 +1,90 @@
11
---
22
layout: post
3-
title: "Recycle EKS worker nodes"
3+
title: "Automatically recycling EKS worker nodes"
44
author: Kuntalb
55
tags:
6-
- eks worker node
7-
- kubernets
6+
- eks
7+
- kubernetes
88
- lambda
99
- step function
10+
- terraform
11+
- featured
1012
team: Core Platform
1113
---
1214

13-
## Introduction
14-
15-
Few months ago, we came across a problem we need to upgrade our kubernetes version in AWS EKS without having a downtime. Getting the control plane upgraded without having a downtime was a easier task, manual but easy. The challenges lies in getting the physical worker node updated. We had to complete the following tasks manually,
15+
A few months ago, we came across a problem we need to upgrade our Kubernetes
16+
version in AWS EKS without having downtime. Getting the control plane
17+
upgraded without downtime was relatively easy, manual but easy. The bigger challenge
18+
was getting the physical worker node updated. We had to manually complete each of the following steps:
1619

1720
1. Create a new worker node with latest configuration
1821
2. Put the old node in standby mode.
1922
3. Taint the old node to unschedulable
20-
4. Then wait for all our existing pods to die gracefully. we have some really long running pods some of them took 20 hours or more to actually finish.
23+
4. Then wait for all our existing pods to die gracefully. In our case, we had some really long running pods, some of which took 20 hours or more to actually finish!
2124
5. Then detach and kill the old node.
2225

23-
While doing that we were thinking how about having an automated module, which will do all these work by just a button click. So we come up with a [terraform module](https://github.com/scribd/terraform-aws-recycle-eks) which will do all these jobs for us.
26+
While doing that we were thinking how about having an automated module, which
27+
will do all these work by just a button click. We are pleased to open source
28+
and share our [terraform-aws-recycle-eks
29+
module](https://github.com/scribd/terraform-aws-recycle-eks) which will do all
30+
these steps for us!
2431

2532
## What Problem does it Solve
2633

2734
1. Periodic recycling of old worker nodes. In fact we can create a lifecycle hook while creating the node and integrate the lifecycle hook with this module. That way the whole periodic recycling will be fully automated via the lifecycle hook and zero downtime via this module, no need for manual intervention at all.
2835
2. Minimal manual interventions while recycling a worker node.
2936
3. This can be integrated with SNS/Cloudwatch events, so that in the middle of the night if there is a CPU spike this Step-function can step up and create a new node while allowing the old node to die gracefully. That way all new tasks coming in can be catered in the new node reducing pressure on the existing node while we investigate the root cause and continue to be in service. There are plenty more use cases like this.
30-
4. This can make upgrading/patching of kubernetes and eks worker nodes much easier
37+
4. This can make upgrading/patching of Kubernetes and eks worker nodes much easier
3138
5. Also this module has a custom label selector as an input, that will help the user to only wait for the pods that matters. Rest everything this module will ignore while waiting for the pods to gracefully finish
3239

33-
##Components
40+
## Components
3441

3542
### Terraform
36-
Terraform has always been our choice of tool for managing infrastructure, and using terraform for this module also gives us the opportunity to integrate this module with all other existing infra seemlessly.
3743

38-
### Lambdas and Step Function
39-
[Orchestrating Amazon Kubernetes Service (EKS)](https://medium.com/@alejandro.millan.frias/managing-kubernetes-from-aws-lambda-7922c3546249) from [AWS Lambda and Amazon EKS Node Drainer](https://github.com/aws-samples/amazon-k8s-node-drainer) has already set a precedence that Lambdas can be a great tool to manage kubernetes infrastructure as well as EKS clusters. but lambdas have one limitation though that they are very short lived. If we run all steps through a single lambda function, it will eventually timeout while waiting for all existing pods to complete. So we need to split up the workflow into multiple lambdas and manage their lifecycles through a workflow manager. Here comes [Step-function](https://aws.amazon.com/step-functions/?step-functions.sort-by=item.additionalFields.postDateTime&step-functions.sort-order=desc). Using step-function not only solves the problem of lambda time-outs but also provides us an opportunity to extend this module to be triggered automatically based on events.
44+
[Terraform](https://terraform.io) has always been our choice of tool for managing infrastructure, and using terraform for this module also gives us the opportunity to integrate this module with all other existing infra seamlessly.
45+
46+
### Lambdas and Step Functions
47+
48+
[Orchestrating Amazon Kubernetes Service
49+
(EKS)](https://medium.com/@alejandro.millan.frias/managing-Kubernetes-from-aws-Lambda-7922c3546249)
50+
from [AWS Lambda and Amazon EKS Node
51+
Drainer](https://github.com/aws-samples/amazon-k8s-node-drainer) has already
52+
set a precedent that Lambdas can be a great tool to manage EKS clusters.
53+
However, Lambdas have one notable limitation in that they are very short lived.
54+
If we run all steps through a single Lambda function, it will eventually
55+
timeout while waiting for all existing pods to complete. So we need to split up
56+
the workflow into multiple Lambdas and manage
57+
their lifecycles through a workflow manager. This is where
58+
[Step Functions](https://aws.amazon.com/step-functions/?step-functions.sort-by=item.additionalFields.postDateTime&step-functions.sort-order=desc) enter the picture.
59+
Using a Step Function not only solves the problem of Lambda time-outs but also
60+
provides us an opportunity to extend this module to be triggered automatically
61+
based on events.
4062

4163
## Design
4264

43-
1. Create a [step-function](https://github.com/scribd/terraform-aws-recycle-eks/blob/main/step-function.json) that will consist of 4 lambdas. This step function will handle the transfer of inputs across the lambda functions.
44-
2. The [first lambda](https://github.com/scribd/terraform-aws-recycle-eks/blob/main/lambdas/putNodesToStandby.py) takes an instance id as an input, to put it in standby state. Using autoscaling api to automatically add a new instance to the group while putting the old instance to standby state. The old instance will get into "Standby" state only when the new instance is in fully "Inservice" state
45-
3. Taint this "Standby" node in EKS using K8S API in [lambda](https://github.com/scribd/terraform-aws-recycle-eks/blob/main/lambdas/taintNodes.py) to prevent new pods from getting scheduled into this node
46-
4. Periodically use K8S API check for status of “stateful” pods on that node based on the label selector provided. Another [Lambda](https://github.com/scribd/terraform-aws-recycle-eks/blob/main/lambdas/checkNodesForRunningPods.py) will do that
47-
5. Once all stateful pods have completed on the node, i.e number of running pod reached 0, shut down that standby instance using AWS SDK via [lambda](https://github.com/scribd/terraform-aws-recycle-eks/blob/main/lambdas/detachAndTerminateNode.py).
65+
1. Create a [Step Function](https://github.com/scribd/terraform-aws-recycle-eks/blob/main/step-function.json) that will consist of 4 Lambdas. This step function will handle the transfer of inputs across the Lambda functions.
66+
2. The [first Lambda](https://github.com/scribd/terraform-aws-recycle-eks/blob/main/Lambdas/putNodesToStandby.py) takes an instance id as an input, to put it in standby state. Using autoscaling api to automatically add a new instance to the group while putting the old instance to standby state. The old instance will get into "Standby" state only when the new instance is in fully "Inservice" state
67+
3. Taint this "Standby" node in EKS using K8S API in [Lambda](https://github.com/scribd/terraform-aws-recycle-eks/blob/main/Lambdas/taintNodes.py) to prevent new pods from getting scheduled into this node
68+
4. Periodically use K8S API check for status of “stateful” pods on that node based on the label selector provided. Another [Lambda](https://github.com/scribd/terraform-aws-recycle-eks/blob/main/Lambdas/checkNodesForRunningPods.py) will do that
69+
5. Once all stateful pods have completed on the node, i.e number of running pod reached 0, shut down that standby instance using AWS SDK via [Lambda](https://github.com/scribd/terraform-aws-recycle-eks/blob/main/Lambdas/detachAndTerminateNode.py).
4870
6. We are not terminating the node, only shutting it down, just in case. In future releases, we will be start terminating the nodes
4971

5072
## Sample Execution
51-
![](/post-images/2020-12-recycle-eks-worker/Step-Function-sample-output.png)
73+
74+
![Sample execution output of the step function](/post-images/2020-12-recycle-eks-worker/Step-Function-sample-output.png)
5275
<font size="3"><center><i>Sample Execution output of the Step Function </i></center></font>
5376

5477
## Future Enhancements
5578

56-
1. Right now, in the first lambda we are putting a 300 sec sellp just to ensure that the new node is in IN Service mode before putting the old node to StandBy mode. We have to ensure this programatically rather than an arbitrary 300 sec sleep
79+
1. Right now, in the first Lambda we are putting a 300 sec sleep just to ensure that the new node is in *IN* Service mode before putting the old node to StandBy mode. We have to ensure this programatically rather than an arbitrary 300 sec sleep
5780
2. Refactor the code to use as a common module for getting the access token.
5881
3. Better logging and exception handling
5982
4. Make use of namespace input while selecting the pods. Currently it checks for pods in all namespaces.
6083
5. Find a terraform way to edit configmap/aws-auth, this step is still manual to make this module work.
6184

85+
---
86+
87+
Within Scribd's Platform Engineering group we have a *lot* more services than
88+
people, so we're always trying to find new ways to automate our infrastructure.
89+
If you're interested in helping to build out scalable data platform to help
90+
change the world reads, [come join us!](/careers/#open-positions)
Loading

0 commit comments

Comments
 (0)