Skip to content

Commit 4bb31aa

Browse files
committed
Copy-editing and re-titling the post
1 parent 48c652b commit 4bb31aa

File tree

2 files changed

+175
-157
lines changed

2 files changed

+175
-157
lines changed

_posts/2020-02-27-breaking-the-mono-airflow-dag-repo.md

Lines changed: 0 additions & 157 deletions
This file was deleted.
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
---
2+
layout: post
3+
title: "Breaking up the Airflow DAG monorepo"
4+
author: qphou
5+
tags:
6+
- airflow
7+
- featured
8+
- airflow-series
9+
- datapipe
10+
team: Core Platform
11+
---
12+
13+
14+
Creating a monorepo to store
15+
[DAGs}(https://airflow.apache.org/docs/stable/concepts.html#dags) is simple,
16+
easy to get started with, but unlikely to scale as DAGs and the number of
17+
developers working with them grows. In the Core Platform team, we're [bringing massive pre-existing DAGs
18+
into Airflow](/blog/2020/modernizing-an-old-data-pipeline.html) and want to
19+
support multiple teams and their orchestration needs in a manner that is simple
20+
and easy to operate.
21+
22+
In this post, I'll share the approach we have taken to bring multi-repo DAG
23+
development to [Apache Airflow](https://airflow.apache.org) and a new open
24+
source daemon I have written to make it possible.
25+
26+
27+
## Delivering DAGs
28+
29+
Every Airflow component expects the DAGs to present in a local DAG folder,
30+
accessed through a filesystem interface. There are 3 common approaches to meet
31+
this requirement:
32+
33+
* Bake DAGs into Airflow docker image
34+
* Push DAGs to a networked filesystem like NFS and mount it as Airflow’s DAG
35+
folder
36+
* Pull DAGs from the network into a local filesystem before starting Airflow
37+
components
38+
39+
Since Airflow is expecting the DAGs to be located in a single local folder,
40+
these approaches are all commonly implemented with a monorepo that is baked,
41+
pushed, or pulled.
42+
43+
Each of these approaches comes with its own set of trade-offs.
44+
45+
* Baking DAGs into Airflow images makes DAG deployment slow because you need to
46+
rebuild and release the new Airflow image for every DAG change.
47+
* Managing a networked filesystem for DAG sync can be overkill from a performance and
48+
operations point of view, considering that Airflow only requires read access.
49+
* Pulling DAGs requires some deployment or other coordination to ensure that the
50+
local filesystem has been populated with the appropriate changes before
51+
starting Aifrlow.
52+
53+
We decided to go with the "pull" model with AWS S3 as our "DAG source of
54+
truth." S3 provides a highly available and easily managed location for our DAG
55+
store, but in order to support our desired multi-repo approach DAGs, we needed
56+
to build our own tooling to coordinate synchronizing the local DAG store with
57+
S3 objects from the multiple DAG repositories.
58+
59+
The tool we build, [objinsync](https://github.com/scribd/objinsync) [^1], is a
60+
stateless DAG sync daemon, which is deployed as a sidecar container. From
61+
Airflow’s point of view, the DAG folder is just a magical local folder that
62+
always contains the up to date DAG definitions assembled from multiple Git
63+
repos.
64+
65+
66+
## The full CI/CD pipeline
67+
68+
To demonstrate how the whole setup works end to end, I think it’s best to walk
69+
through the life cycle of a DAG file.
70+
71+
As shown below, S3 DAG bucket is acting as the bridge between Git repos and
72+
Airflow:
73+
74+
![Using S3 as the bridge](/post-images/2020-03-airflow/s3-as-bridge.png)
75+
76+
All CI/CD pipelines will publish their own DAG files into the same S3 bucket
77+
namespaced by the repo name. On the other end, objinsync deployed for each
78+
Airflow component will pull the latest DAG files into local filesystem for
79+
Airflow to consume.
80+
81+
Zooming out to the full picture, all DAG updates go through a multi-stage
82+
testing and promotion process before they eventually land in the production DAG
83+
S3 bucket:
84+
85+
![DAG release pipeline](/post-images/2020-03-airflow/dag-release-pipeline.png)
86+
87+
This gives our engineers the ability to test out changes without worrying about
88+
impacting production data pipelines.
89+
90+
Lastly, to simplify CI/CD setup, we built a small command line tool in Go to
91+
automate DAG to S3 release process. For any repo’s CI/CD pipeline, all an
92+
engineer has to do is adding a single step that runs the following command:
93+
94+
```
95+
airflow-dag-push [dev,staging,production]
96+
```
97+
98+
The airflow-dag-push tool will automatically scan for DAG files in a special
99+
folder named `workflow` under the root source tree and upload them to the right
100+
S3 bucket with the right key prefix based on the provided environment name and
101+
environment variables injected by the CI/CD system.
102+
103+
104+
## Implementation details
105+
106+
Our airflow clusters are orchestrated using both ECS fargate and EKS. ECS is
107+
used to run Airflow web server and scheduler while EKS is what’s powering
108+
Airflow’s Kubernetes executor. Due to differences in different Airflow
109+
components, we need to run `objinsync` binary in two container orchestration
110+
platforms with slightly different setups.
111+
112+
For daemon Airflow components like web server and scheduler, we run
113+
`objinsync` in a continuous sync mode where it pulls incremental updates from
114+
S3 to local filesystem every 5 seconds. This is implemented using sidecar
115+
container pattern. The DAG folder is mounted as a shared volume between the
116+
Airflow web/scheduler container and objinsync container. The sidecar
117+
objinsync container is setup to run the following command:
118+
119+
```
120+
/bin/objinsync pull s3://<S3_DAG_BUCKET>/airflow_home/dags <YOUR_AIRFLOW_HOME>/dags
121+
```
122+
123+
For other components like task instance pod that runs to completion, we run
124+
`objinsync`in pull once mode where it only pulls the required DAG from S3 once
125+
before the Airflow component starts. This is implemented using Airflow K8S
126+
executor’s builtin git sync container feature. We are effectively replacing git
127+
invocation with `objinsync` in this case.
128+
129+
**Environment variables for Airflow scheduler:**
130+
131+
```
132+
AIRFLOW__KUBERNETES__GIT_REPO=s3://<S3_DAG_BUCKET>/airflow_home/dags
133+
AIRFLOW__KUBERNETES__GIT_SYNC_DEST=<YOUR_AIRFLOW_HOME>/dags
134+
AIRFLOW__KUBERNETES__GIT_SYNC_ROOT=<YOUR_AIRFLOW_HOME>/dags
135+
AIRFLOW__KUBERNETES__GIT_SYNC_CONTAINER_REPOSITORY=<DOCKER_REPO_FOR_GIT_SYNC_CONTAINER>
136+
AIRFLOW__KUBERNETES__GIT_SYNC_CONTAINER_TAG=<DOCKER_TAG_FOR_GIT_SYNC_CONTAINER>
137+
AIRFLOW__KUBERNETES__GIT_SYNC_RUN_AS_USER=0
138+
AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT=<YOUR_AIRFLOW_HOME>/dags
139+
# dummy branch value to stop Airflow from complaining
140+
AIRFLOW__KUBERNETES__GIT_BRANCH=dummy
141+
```
142+
143+
144+
**Entry point for the git sync container image:**
145+
146+
```
147+
/bin/objinsync pull --once "${GIT_SYNC_REPO}" "${GIT_SYNC_DEST}"
148+
```
149+
150+
`objinsync` is implemented in [Go](https://golang.org/) to keep memory footprint very low. It also means
151+
the synchronization code can leverage the powerful parallelism and concurrency
152+
primitives from the Go runtime for better performance.
153+
154+
155+
## Take away
156+
157+
Engineering is all about making the right trade-offs. I won’t claim what we have
158+
is the perfect solution for everyone, but I do believe it strikes a good
159+
balance between productivity, operability, and availability. If you have any
160+
question regarding the setup, I am available in Airflow’s
161+
[#airflow-creative](https://apache-airflow.slack.com/messages/airflow-creative)
162+
slack channel under the handle "QP." If you are not already part of the Airflow
163+
Slack community, you can get access via
164+
[this link](https://apache-airflow-slack.herokuapp.com/).
165+
166+
This is the second blog post from our series of [data pipeline
167+
migration](/blog/2020/modernizing-an-old-data-pipeline.html)
168+
blog posts.
169+
The Core Platform is building scalable data and ML tools, with open source
170+
technology, to enable exciting new products at Scribd. If that sounds
171+
interesting to you, [we're hiring](/careers/#open-positions)! :)
172+
173+
174+
175+
[^1]: ObjInSync is a generic S3 to local filesystem sync daemon [open sourced](https://github.com/scribd/objinsync) by Scribd.

0 commit comments

Comments
 (0)