|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Breaking up the Airflow DAG monorepo" |
| 4 | +author: qphou |
| 5 | +tags: |
| 6 | +- airflow |
| 7 | +- featured |
| 8 | +- airflow-series |
| 9 | +- datapipe |
| 10 | +team: Core Platform |
| 11 | +--- |
| 12 | + |
| 13 | + |
| 14 | +Creating a monorepo to store |
| 15 | +[DAGs}(https://airflow.apache.org/docs/stable/concepts.html#dags) is simple, |
| 16 | +easy to get started with, but unlikely to scale as DAGs and the number of |
| 17 | +developers working with them grows. In the Core Platform team, we're [bringing massive pre-existing DAGs |
| 18 | +into Airflow](/blog/2020/modernizing-an-old-data-pipeline.html) and want to |
| 19 | +support multiple teams and their orchestration needs in a manner that is simple |
| 20 | +and easy to operate. |
| 21 | + |
| 22 | +In this post, I'll share the approach we have taken to bring multi-repo DAG |
| 23 | +development to [Apache Airflow](https://airflow.apache.org) and a new open |
| 24 | +source daemon I have written to make it possible. |
| 25 | + |
| 26 | + |
| 27 | +## Delivering DAGs |
| 28 | + |
| 29 | +Every Airflow component expects the DAGs to present in a local DAG folder, |
| 30 | +accessed through a filesystem interface. There are 3 common approaches to meet |
| 31 | +this requirement: |
| 32 | + |
| 33 | +* Bake DAGs into Airflow docker image |
| 34 | +* Push DAGs to a networked filesystem like NFS and mount it as Airflow’s DAG |
| 35 | +folder |
| 36 | +* Pull DAGs from the network into a local filesystem before starting Airflow |
| 37 | +components |
| 38 | + |
| 39 | +Since Airflow is expecting the DAGs to be located in a single local folder, |
| 40 | +these approaches are all commonly implemented with a monorepo that is baked, |
| 41 | +pushed, or pulled. |
| 42 | + |
| 43 | +Each of these approaches comes with its own set of trade-offs. |
| 44 | + |
| 45 | +* Baking DAGs into Airflow images makes DAG deployment slow because you need to |
| 46 | + rebuild and release the new Airflow image for every DAG change. |
| 47 | +* Managing a networked filesystem for DAG sync can be overkill from a performance and |
| 48 | + operations point of view, considering that Airflow only requires read access. |
| 49 | +* Pulling DAGs requires some deployment or other coordination to ensure that the |
| 50 | + local filesystem has been populated with the appropriate changes before |
| 51 | + starting Aifrlow. |
| 52 | + |
| 53 | +We decided to go with the "pull" model with AWS S3 as our "DAG source of |
| 54 | +truth." S3 provides a highly available and easily managed location for our DAG |
| 55 | +store, but in order to support our desired multi-repo approach DAGs, we needed |
| 56 | +to build our own tooling to coordinate synchronizing the local DAG store with |
| 57 | +S3 objects from the multiple DAG repositories. |
| 58 | + |
| 59 | +The tool we build, [objinsync](https://github.com/scribd/objinsync) [^1], is a |
| 60 | +stateless DAG sync daemon, which is deployed as a sidecar container. From |
| 61 | +Airflow’s point of view, the DAG folder is just a magical local folder that |
| 62 | +always contains the up to date DAG definitions assembled from multiple Git |
| 63 | +repos. |
| 64 | + |
| 65 | + |
| 66 | +## The full CI/CD pipeline |
| 67 | + |
| 68 | +To demonstrate how the whole setup works end to end, I think it’s best to walk |
| 69 | +through the life cycle of a DAG file. |
| 70 | + |
| 71 | +As shown below, S3 DAG bucket is acting as the bridge between Git repos and |
| 72 | +Airflow: |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | +All CI/CD pipelines will publish their own DAG files into the same S3 bucket |
| 77 | +namespaced by the repo name. On the other end, objinsync deployed for each |
| 78 | +Airflow component will pull the latest DAG files into local filesystem for |
| 79 | +Airflow to consume. |
| 80 | + |
| 81 | +Zooming out to the full picture, all DAG updates go through a multi-stage |
| 82 | +testing and promotion process before they eventually land in the production DAG |
| 83 | +S3 bucket: |
| 84 | + |
| 85 | + |
| 86 | + |
| 87 | +This gives our engineers the ability to test out changes without worrying about |
| 88 | +impacting production data pipelines. |
| 89 | + |
| 90 | +Lastly, to simplify CI/CD setup, we built a small command line tool in Go to |
| 91 | +automate DAG to S3 release process. For any repo’s CI/CD pipeline, all an |
| 92 | +engineer has to do is adding a single step that runs the following command: |
| 93 | + |
| 94 | +``` |
| 95 | +airflow-dag-push [dev,staging,production] |
| 96 | +``` |
| 97 | + |
| 98 | +The airflow-dag-push tool will automatically scan for DAG files in a special |
| 99 | +folder named `workflow` under the root source tree and upload them to the right |
| 100 | +S3 bucket with the right key prefix based on the provided environment name and |
| 101 | +environment variables injected by the CI/CD system. |
| 102 | + |
| 103 | + |
| 104 | +## Implementation details |
| 105 | + |
| 106 | +Our airflow clusters are orchestrated using both ECS fargate and EKS. ECS is |
| 107 | +used to run Airflow web server and scheduler while EKS is what’s powering |
| 108 | +Airflow’s Kubernetes executor. Due to differences in different Airflow |
| 109 | +components, we need to run `objinsync` binary in two container orchestration |
| 110 | +platforms with slightly different setups. |
| 111 | + |
| 112 | +For daemon Airflow components like web server and scheduler, we run |
| 113 | +`objinsync` in a continuous sync mode where it pulls incremental updates from |
| 114 | +S3 to local filesystem every 5 seconds. This is implemented using sidecar |
| 115 | +container pattern. The DAG folder is mounted as a shared volume between the |
| 116 | +Airflow web/scheduler container and objinsync container. The sidecar |
| 117 | +objinsync container is setup to run the following command: |
| 118 | + |
| 119 | +``` |
| 120 | +/bin/objinsync pull s3://<S3_DAG_BUCKET>/airflow_home/dags <YOUR_AIRFLOW_HOME>/dags |
| 121 | +``` |
| 122 | + |
| 123 | +For other components like task instance pod that runs to completion, we run |
| 124 | +`objinsync`in pull once mode where it only pulls the required DAG from S3 once |
| 125 | +before the Airflow component starts. This is implemented using Airflow K8S |
| 126 | +executor’s builtin git sync container feature. We are effectively replacing git |
| 127 | +invocation with `objinsync` in this case. |
| 128 | + |
| 129 | +**Environment variables for Airflow scheduler:** |
| 130 | + |
| 131 | +``` |
| 132 | +AIRFLOW__KUBERNETES__GIT_REPO=s3://<S3_DAG_BUCKET>/airflow_home/dags |
| 133 | +AIRFLOW__KUBERNETES__GIT_SYNC_DEST=<YOUR_AIRFLOW_HOME>/dags |
| 134 | +AIRFLOW__KUBERNETES__GIT_SYNC_ROOT=<YOUR_AIRFLOW_HOME>/dags |
| 135 | +AIRFLOW__KUBERNETES__GIT_SYNC_CONTAINER_REPOSITORY=<DOCKER_REPO_FOR_GIT_SYNC_CONTAINER> |
| 136 | +AIRFLOW__KUBERNETES__GIT_SYNC_CONTAINER_TAG=<DOCKER_TAG_FOR_GIT_SYNC_CONTAINER> |
| 137 | +AIRFLOW__KUBERNETES__GIT_SYNC_RUN_AS_USER=0 |
| 138 | +AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT=<YOUR_AIRFLOW_HOME>/dags |
| 139 | +# dummy branch value to stop Airflow from complaining |
| 140 | +AIRFLOW__KUBERNETES__GIT_BRANCH=dummy |
| 141 | +``` |
| 142 | + |
| 143 | + |
| 144 | +**Entry point for the git sync container image:** |
| 145 | + |
| 146 | +``` |
| 147 | +/bin/objinsync pull --once "${GIT_SYNC_REPO}" "${GIT_SYNC_DEST}" |
| 148 | +``` |
| 149 | + |
| 150 | +`objinsync` is implemented in [Go](https://golang.org/) to keep memory footprint very low. It also means |
| 151 | +the synchronization code can leverage the powerful parallelism and concurrency |
| 152 | +primitives from the Go runtime for better performance. |
| 153 | + |
| 154 | + |
| 155 | +## Take away |
| 156 | + |
| 157 | +Engineering is all about making the right trade-offs. I won’t claim what we have |
| 158 | +is the perfect solution for everyone, but I do believe it strikes a good |
| 159 | +balance between productivity, operability, and availability. If you have any |
| 160 | +question regarding the setup, I am available in Airflow’s |
| 161 | +[#airflow-creative](https://apache-airflow.slack.com/messages/airflow-creative) |
| 162 | +slack channel under the handle "QP." If you are not already part of the Airflow |
| 163 | +Slack community, you can get access via |
| 164 | +[this link](https://apache-airflow-slack.herokuapp.com/). |
| 165 | + |
| 166 | +This is the second blog post from our series of [data pipeline |
| 167 | +migration](/blog/2020/modernizing-an-old-data-pipeline.html) |
| 168 | +blog posts. |
| 169 | +The Core Platform is building scalable data and ML tools, with open source |
| 170 | +technology, to enable exciting new products at Scribd. If that sounds |
| 171 | +interesting to you, [we're hiring](/careers/#open-positions)! :) |
| 172 | + |
| 173 | + |
| 174 | + |
| 175 | +[^1]: ObjInSync is a generic S3 to local filesystem sync daemon [open sourced](https://github.com/scribd/objinsync) by Scribd. |
0 commit comments