Skip to content

Commit d85c1fa

Browse files
author
Qingping Hou
committed
add multi-repo dag blog post
1 parent e1287ff commit d85c1fa

File tree

3 files changed

+156
-0
lines changed

3 files changed

+156
-0
lines changed
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
---
2+
layout: post
3+
title: "Breaking the mono Airflow DAG repo"
4+
author: qphou
5+
tags:
6+
- airflow
7+
- featured
8+
- datapipe
9+
team: Core Platform
10+
---
11+
12+
Hopefully you are using some kind of version control system to manage all your
13+
Airflow DAGs. If you are, then it’s very likely that all your DAGs are managed
14+
in one single repository.
15+
16+
Mono DAG repo is simple and easy to operate to start with. But it’s not hard to
17+
see that it doesn’t scale well as number of DAGs and engineers grows.
18+
19+
In this post, I would like to share what Scribd’s Core Platform team did to
20+
bring multi repo DAG development a reality to Airflow.
21+
22+
23+
## Delivering DAGs
24+
25+
Every Airflow component expects all the DAGs to present in a local DAG folder
26+
through a filesystem interface. There are 3 common approaches to meet this
27+
requirement:
28+
29+
* Bake DAGs into Airflow docker image
30+
* Sync DAGs to a networked file system like NFS and mount it as Airflow’s DAG
31+
folder
32+
* Pull DAGs from the network into a local file system before starting Airflow
33+
components
34+
35+
Each of the above approaches comes with its own trade-offs. Baking DAGs into
36+
Airflow images makes DAG deployment slow because you need to rebuild and
37+
release the new Airflow image for every DAG change. Managing a networked
38+
file system for DAG sync seems like an overkill from a performance and operation
39+
point of view given that Airflow only requires read access.
40+
41+
We decided to go with the pulling model. By leveraging AWS S3, we can have a
42+
high uptime guarantee for our DAG store. We also made this whole process
43+
transparent to Airflow by running
44+
[objinsync](https://github.com/scribd/objinsync) [^1], a stateless DAG sync daemon,
45+
as a sidecar container. From Airflow’s point of view, the DAG folder is just a
46+
magical local folder that always contains the up to date DAG definitions
47+
assembled from multiple Git repos.
48+
49+
50+
## The full CI/CD pipeline
51+
52+
To demonstrate how the whole setup works end to end, I think it’s best to walk
53+
through the life cycle of a DAG file.
54+
55+
As shown below, S3 DAG bucket is acting as the bridge between Git repos and
56+
Airflow:
57+
58+
![Using S3 as the bridge](/post-images/2020-03-airflow/s3-as-bridge.png)
59+
60+
All CI/CD pipelines will publish their own DAG files into the same S3 bucket
61+
namespaced by the repo name. On the other end, objinsync[^1] deployed for each
62+
Airflow component will pull the latest DAG files into local file system for
63+
Airflow to consume.
64+
65+
Zooming out to the full picture, all DAG updates go through a multi-stage
66+
testing and promotion process before they eventually land in the production DAG
67+
S3 bucket:
68+
69+
![DAG release pipeline](/post-images/2020-03-airflow/dag-release-pipeline.png)
70+
71+
This gives our engineers the ability to test out changes without worrying about
72+
impacting production data pipelines.
73+
74+
Lastly, to simplify CI/CD setup, we built a small command line tool in Go to
75+
automate DAG to S3 release process. For any repo’s CI/CD pipeline, all an
76+
engineer has to do is adding a single step that runs the following command:
77+
78+
```
79+
airflow-dag-push [dev,staging,production]
80+
```
81+
82+
The airflow-dag-push tool will automatically scan for DAG files in a special
83+
folder named `workflow` under the root source tree and upload them to the right
84+
S3 bucket with the right key prefix based on the provided environment name and
85+
environment variables injected by the CI/CD system.
86+
87+
88+
## Implementation details
89+
90+
Our airflow clusters are orchestrated using both ECS fargate and EKS. ECS is
91+
used to run Airflow web server and scheduler while EKS is what’s powering
92+
Airflow’s Kubernetes executor. Due to differences in different Airflow
93+
components, we need to run objinsync[^1] binary in two container orchestration
94+
platforms with slightly different setups.
95+
96+
For daemon Airflow components like web server and scheduler, we run
97+
objinsync[^1] in a continuous sync mode where it pulls incremental updates from
98+
S3 to local filesystem every 5 seconds. This is implemented using sidecar
99+
container pattern. The DAG folder is mounted as a shared volume between the
100+
Airflow web/scheduler container and objinsync[^1] container. The sidecar
101+
objinsync container is setup to run the following command:
102+
103+
```
104+
/bin/objinsync pull s3://<S3_DAG_BUCKET>/airflow_home/dags <YOUR_AIRFLOW_HOME>/dags
105+
```
106+
107+
For other components like task instance pod that runs to completion, we run
108+
objinsync[^1] in pull once mode where it only pulls the required DAG from S3 once
109+
before the Airflow component starts. This is implemented using Airflow K8S
110+
executor’s builtin git sync container feature. We are effectively replacing git
111+
invocation with objinsync[^1] in this case.
112+
113+
```
114+
Environment variables for Airflow scheduler:
115+
116+
AIRFLOW__KUBERNETES__GIT_REPO=s3://<S3_DAG_BUCKET>/airflow_home/dags
117+
AIRFLOW__KUBERNETES__GIT_SYNC_DEST=<YOUR_AIRFLOW_HOME>/dags
118+
AIRFLOW__KUBERNETES__GIT_SYNC_ROOT=<YOUR_AIRFLOW_HOME>/dags
119+
AIRFLOW__KUBERNETES__GIT_SYNC_CONTAINER_REPOSITORY=<DOCKER_REPO_FOR_GIT_SYNC_CONTAINER>
120+
AIRFLOW__KUBERNETES__GIT_SYNC_CONTAINER_TAG=<DOCKER_TAG_FOR_GIT_SYNC_CONTAINER>
121+
AIRFLOW__KUBERNETES__GIT_SYNC_RUN_AS_USER=0
122+
AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT=<YOUR_AIRFLOW_HOME>/dags
123+
# dummy branch value to stop Airflow from complaining
124+
AIRFLOW__KUBERNETES__GIT_BRANCH=dummy
125+
126+
127+
Entry point for the git sync container image:
128+
129+
/bin/objinsync pull --once "${GIT_SYNC_REPO}" "${GIT_SYNC_DEST}"
130+
```
131+
132+
Objinsync[^1] is implemented in Go to keep memory footprint very low. It also means
133+
the synchronization code can leverage the powerful parallelism and concurrency
134+
primitives from Go runtime for better performance.
135+
136+
137+
## Take away
138+
139+
Engineering is all about making the right trade-offs. I won’t claim what we have
140+
is the perfect solution for everyone, but I do believe it strikes a good
141+
balance between productivity, operability and availability. If you have any
142+
question regarding the setup, I am available in Airflow’s
143+
[#airflow-creative](https://apache-airflow.slack.com/messages/airflow-creative)
144+
slack channel under the handle QP. You can get access to the slack workspace
145+
through [this link](https://apache-airflow-slack.herokuapp.com/).
146+
147+
This is the second blog post from our series of [data pipeline
148+
migration](https://tech.scribd.com/blog/2020/modernizing-an-old-data-pipeline.html)
149+
blog posts. If this sounds interesting to you, the Core platform team is
150+
hiring!
151+
152+
Come join us if you love building scalable data/ML platforms using open source
153+
technologies. :)
154+
155+
156+
[^1]: Ojinsync is a generic S3 to local file system sync daemon [open sourced](https://github.com/scribd/objinsync) by scribd.
Loading
20.5 KB
Loading

0 commit comments

Comments
 (0)