Skip to content

Commit b107b92

Browse files
nakulpathak3rtyler
authored andcommitted
Fargate deploy post updates
1 parent e6ef06c commit b107b92

File tree

1 file changed

+12
-10
lines changed

1 file changed

+12
-10
lines changed

_posts/2021-03-18-faster-fargate-deploys.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ tags:
99
team: Internal Tools
1010
---
1111

12-
Scribd moved its monolith to AWS in April 2020 and as part of the migration, we had to design and implement new deployment pipelines in our *new and shiny* [ECS Fargate](https://aws.amazon.com/fargate/) infrastructure. In this post, we'll share how we improved our deployment speeds from ~40 minutes to less than 20 minutes.
12+
Scribd moved its monolith to AWS in April 2020 and as part of the migration, we had to design and implement a deployment pipeline for our new (and *shiny*) [ECS Fargate](https://aws.amazon.com/fargate/) infrastructure. In this post, we'll share how we improved our deployment speeds from ~40 minutes to less than 20 minutes.
1313

1414
### Original Implementation
1515

@@ -21,38 +21,40 @@ Our starting implementation involved a few steps:
2121
### Improvements
2222

2323
#### Fargate Service Updates
24-
By far, the slowest part of our deployment was waiting for ECS services to finish updating. We use the default rolling deployment which stops and starts tasks to force re-pulling of the freshly-uploaded [ECR](https://aws.amazon.com/ecr/) image. We were able to reduce this time to 16-18 minutes with the following -
24+
By far, the slowest part of our deployment was waiting for ECS services to finish updating. We use the default rolling deployment which stops and starts tasks to trigger a re-pulling of the freshly-uploaded [ECR](https://aws.amazon.com/ecr/) image. Here are some changes we implemented -
2525

2626
* **Docker Image Size Reduction** - The first thing everyone thinks of when considering ECS Fargate speedups is how to reduce the image pull time since Fargate (unlike EC2) [has no image caching](https://github.com/aws/containers-roadmap/issues/696). However, unless you can drastically reduce your image size (think 1Gb to 100Mb), this will not lead to significant time reductions. We reduced our compressed image size from ~900Mb to ~700Mb and it led to **little to no improvement**. It did lead to a cleaner image but that wasn't our initial goal.
2727

28-
* [**Deregistration Delay**](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#deregistration-delay) - This is a property on your load balancer's target group that dictates how long a task stays in *Draining* state after it stops receiving requests. We looked in Datadog APM for the p99 latencies of our longest-running requests and changed the value from the **default 300 seconds** to 17s. This reduced service refreshes to ~22 minutes.
28+
* [**Deregistration Delay**](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#deregistration-delay) - This is a property on a load balancer's target group that dictates how long a task stays in *Draining* state after it stops receiving requests. We looked at Datadog APM for the p99 of our longest running request and set the delay to 17s from the **default of 300s**. This reduced service refreshes to ~22 minutes.
2929

3030
* **ECS Throttling** - During deployments, we investigated the "Events" tab of our main web ECS service. There were events with the following messages -
3131
- *"service production-web operations are being throttled on elb. Will try again later."*
3232
- *"service production-web operations are being throttled. Will try again later."*
33-
Due to Scribd's high Fargate task volume, the number of requests we were making during rolling deploys to start and stop tasks was too high for AWS' default limits. We opened support tickets with the ELB and Fargate teams and were able to get those limits improved/removed. This further reduced service deploy time to 16-18 minutes.
3433

35-
* **Network Load Balancer Health Checks** - From testing in staging, we noticed that reducing our network load balancer's [health-check intervals and thresholds](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html) helped reduce staging deploy time from ~9 to ~6 minutes. However, it only translated to 1-2 minutes saved in production with much higher number of ECS tasks. You do want to be careful with the value to avoid false-positive health checks and keep in mind that updating these values requires re-creation of the ECS service it points to.
34+
Due to Scribd's high Fargate task volume, the number of start and stop requests we were making was too high for AWS' default limits. We opened support tickets with the ELB and Fargate teams to get those limits increased. This further reduced service deploy time to 16-18 minutes.
35+
36+
* **Network Load Balancer Health Checks** - From testing in staging, we noticed that reducing our network load balancer's [health-check intervals and thresholds](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html) helped reduce staging deploy time from ~9 to ~6 minutes. However, it only translated to 1-2 minutes saved in production with much higher number of ECS tasks. You do want to be careful with the value to avoid false-positive health checks and keep in mind that updating these values requires re-creation of the ECS service the load balancer points to.
3637

3738
#### Asset Deployment Improvements
38-
Our asset deployments were run using Capistrano. The job `ssh`-ed onto our asset servers, ran a series of tasks to download, unzip, and correctly place assets. There were some issues with this approach -
39+
Our asset deployments were run using Capistrano. The job `ssh`-ed onto our asset servers, ran a series of [Rake tasks]((https://guides.rubyonrails.org/v4.2/command_line.html#rake)) to download, unzip, and correctly place assets. There were some issues with this approach -
3940
* Dependency on Capistrano gem forced us to use the monolith Docker image as the job's base image
40-
* Our ECS service refresh job runs `docker push/pull` tasks to upload the latest image to ECR. Since we wanted to avoid Docker-in-Docker due to further bloating of the monolith image for this one case, we had separate jobs for asset and container deployment. This forced us to waste valuable Gitlab job startup and shutdown time.
41+
* Running Rake tasks required loading the application which adds time to the job
42+
* Our ECS service refresh job runs `docker push/pull` tasks to upload the latest image to ECR. This forced us to have separate jobs for asset and service deployments to avoid adding a Docker dependency to the monolith image.
4143

42-
To resolve these issues, we decided to remove Capistrano as a dependency and wrote Ruby and Bash code that performed the exact same tasks. This was added to our service deployment job and brought asset deploy time from 2.5 minutes to 30s.
44+
To resolve these issues, we decided to remove Capistrano & Rake as dependencies and wrote pure Ruby and Bash code to perform the tasks. This unified the two jobs and brought asset deploy time from 2.5 minutes to 30s.
4345

4446
#### Database Migration
4547
In our case, running a database migration task in Fargate involved starting a new task instance of our `database_migration` task family. Due to Fargate startup slowness, this task would take 3 minutes to run a simple `bundle exec rails db:migrate`.
4648

4749
To resolve this, we used `git` and [Gitlab environments](https://docs.gitlab.com/ee/api/environments.html#get-a-specific-environment) to look for modified files in the `db/migrate` directory. If none were found, we would skip running the migration task. Since majority of our deployments don't run database migration tasks, this shaved off 3 minutes from most jobs.
4850
```
49-
env_json=$(curl --silent --header "<SECRET_ENV_STUFF>" "<gitlab-repository-path>/environments/:id")
51+
env_json=$(curl --silent --header "PRIVATE-TOKEN: <your_access_token>" "<gitlab-repository-path>/environments/<id>")
5052
last_deployment_sha=$(echo $env_json | jq -r '.last_deployment.sha')
5153
git diff --name-only $CI_COMMIT_SHA $last_deployment_sha | grep db/migrate
5254
```
5355

5456
#### Other things to look for
55-
If you run sidecar containers like Datadog, make sure that you're providing enough memory and CPU to those sidecars to avoid waiting on the sidecar to come up while your main container is ready.
57+
If you run sidecar containers like Datadog, make sure that you're providing enough memory and CPU to those containers to avoid waiting on them to be ready while your main container has already started.
5658

5759

5860
We hope this helps you speed up your deployments and gain greater efficiency!

0 commit comments

Comments
 (0)