|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Breaking the mono Airflow DAG repo" |
| 4 | +author: qphou |
| 5 | +tags: |
| 6 | +- airflow |
| 7 | +- featured |
| 8 | +- datapipe |
| 9 | +team: Core Platform |
| 10 | +--- |
| 11 | + |
| 12 | +Hopefully you are using some kind of version control system to manage all your |
| 13 | +Airflow DAGs. If you are, then it’s very likely that all your DAGs are managed |
| 14 | +in one single repository. |
| 15 | + |
| 16 | +Mono DAG repo is simple and easy to operate to start with. But it’s not hard to |
| 17 | +see that it doesn’t scale well as number of DAGs and engineers grows. |
| 18 | + |
| 19 | +In this post, I would like to share what Scribd’s Core Platform team did to |
| 20 | +bring multi repo DAG development a reality to Airflow. |
| 21 | + |
| 22 | + |
| 23 | +## Delivering DAGs |
| 24 | + |
| 25 | +Every Airflow component expects all the DAGs to present in a local DAG folder |
| 26 | +through a filesystem interface. There are 3 common approaches to meet this |
| 27 | +requirement: |
| 28 | + |
| 29 | +* Bake DAGs into Airflow docker image |
| 30 | +* Sync DAGs to a networked file system like NFS and mount it as Airflow’s DAG |
| 31 | +folder |
| 32 | +* Pull DAGs from the network into a local file system before starting Airflow |
| 33 | +components |
| 34 | + |
| 35 | +Each of the above approaches comes with its own trade-offs. Baking DAGs into |
| 36 | +Airflow images makes DAG deployment slow because you need to rebuild and |
| 37 | +release the new Airflow image for every DAG change. Managing a networked |
| 38 | +file system for DAG sync seems like an overkill from a performance and operation |
| 39 | +point of view given that Airflow only requires read access. |
| 40 | + |
| 41 | +We decided to go with the pulling model. By leveraging AWS S3, we can have a |
| 42 | +high uptime guarantee for our DAG store. We also made this whole process |
| 43 | +transparent to Airflow by running |
| 44 | +[objinsync](https://github.com/scribd/objinsync) [^1], a stateless DAG sync daemon, |
| 45 | +as a sidecar container. From Airflow’s point of view, the DAG folder is just a |
| 46 | +magical local folder that always contains the up to date DAG definitions |
| 47 | +assembled from multiple Git repos. |
| 48 | + |
| 49 | + |
| 50 | +## The full CI/CD pipeline |
| 51 | + |
| 52 | +To demonstrate how the whole setup works end to end, I think it’s best to walk |
| 53 | +through the life cycle of a DAG file. |
| 54 | + |
| 55 | +As shown below, S3 DAG bucket is acting as the bridge between Git repos and |
| 56 | +Airflow: |
| 57 | + |
| 58 | + |
| 59 | + |
| 60 | +All CI/CD pipelines will publish their own DAG files into the same S3 bucket |
| 61 | +namespaced by the repo name. On the other end, objinsync[^1] deployed for each |
| 62 | +Airflow component will pull the latest DAG files into local file system for |
| 63 | +Airflow to consume. |
| 64 | + |
| 65 | +Zooming out to the full picture, all DAG updates go through a multi-stage |
| 66 | +testing and promotion process before they eventually land in the production DAG |
| 67 | +S3 bucket: |
| 68 | + |
| 69 | + |
| 70 | + |
| 71 | +This gives our engineers the ability to test out changes without worrying about |
| 72 | +impacting production data pipelines. |
| 73 | + |
| 74 | +Lastly, to simplify CI/CD setup, we built a small command line tool in Go to |
| 75 | +automate DAG to S3 release process. For any repo’s CI/CD pipeline, all an |
| 76 | +engineer has to do is adding a single step that runs the following command: |
| 77 | + |
| 78 | +``` |
| 79 | +airflow-dag-push [dev,staging,production] |
| 80 | +``` |
| 81 | + |
| 82 | +The airflow-dag-push tool will automatically scan for DAG files in a special |
| 83 | +folder named `workflow` under the root source tree and upload them to the right |
| 84 | +S3 bucket with the right key prefix based on the provided environment name and |
| 85 | +environment variables injected by the CI/CD system. |
| 86 | + |
| 87 | + |
| 88 | +## Implementation details |
| 89 | + |
| 90 | +Our airflow clusters are orchestrated using both ECS fargate and EKS. ECS is |
| 91 | +used to run Airflow web server and scheduler while EKS is what’s powering |
| 92 | +Airflow’s Kubernetes executor. Due to differences in different Airflow |
| 93 | +components, we need to run objinsync[^1] binary in two container orchestration |
| 94 | +platforms with slightly different setups. |
| 95 | + |
| 96 | +For daemon Airflow components like web server and scheduler, we run |
| 97 | +objinsync[^1] in a continuous sync mode where it pulls incremental updates from |
| 98 | +S3 to local filesystem every 5 seconds. This is implemented using sidecar |
| 99 | +container pattern. The DAG folder is mounted as a shared volume between the |
| 100 | +Airflow web/scheduler container and objinsync[^1] container. The sidecar |
| 101 | +objinsync container is setup to run the following command: |
| 102 | + |
| 103 | +``` |
| 104 | +/bin/objinsync pull s3://<S3_DAG_BUCKET>/airflow_home/dags <YOUR_AIRFLOW_HOME>/dags |
| 105 | +``` |
| 106 | + |
| 107 | +For other components like task instance pod that runs to completion, we run |
| 108 | +objinsync[^1] in pull once mode where it only pulls the required DAG from S3 once |
| 109 | +before the Airflow component starts. This is implemented using Airflow K8S |
| 110 | +executor’s builtin git sync container feature. We are effectively replacing git |
| 111 | +invocation with objinsync[^1] in this case. |
| 112 | + |
| 113 | +``` |
| 114 | +Environment variables for Airflow scheduler: |
| 115 | +
|
| 116 | +AIRFLOW__KUBERNETES__GIT_REPO=s3://<S3_DAG_BUCKET>/airflow_home/dags |
| 117 | +AIRFLOW__KUBERNETES__GIT_SYNC_DEST=<YOUR_AIRFLOW_HOME>/dags |
| 118 | +AIRFLOW__KUBERNETES__GIT_SYNC_ROOT=<YOUR_AIRFLOW_HOME>/dags |
| 119 | +AIRFLOW__KUBERNETES__GIT_SYNC_CONTAINER_REPOSITORY=<DOCKER_REPO_FOR_GIT_SYNC_CONTAINER> |
| 120 | +AIRFLOW__KUBERNETES__GIT_SYNC_CONTAINER_TAG=<DOCKER_TAG_FOR_GIT_SYNC_CONTAINER> |
| 121 | +AIRFLOW__KUBERNETES__GIT_SYNC_RUN_AS_USER=0 |
| 122 | +AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT=<YOUR_AIRFLOW_HOME>/dags |
| 123 | +# dummy branch value to stop Airflow from complaining |
| 124 | +AIRFLOW__KUBERNETES__GIT_BRANCH=dummy |
| 125 | +
|
| 126 | +
|
| 127 | +Entry point for the git sync container image: |
| 128 | +
|
| 129 | +/bin/objinsync pull --once "${GIT_SYNC_REPO}" "${GIT_SYNC_DEST}" |
| 130 | +``` |
| 131 | + |
| 132 | +Objinsync[^1] is implemented in Go to keep memory footprint very low. It also means |
| 133 | +the synchronization code can leverage the powerful parallelism and concurrency |
| 134 | +primitives from Go runtime for better performance. |
| 135 | + |
| 136 | + |
| 137 | +## Take away |
| 138 | + |
| 139 | +Engineering is all about making the right trade-offs. I won’t claim what we have |
| 140 | +is the perfect solution for everyone, but I do believe it strikes a good |
| 141 | +balance between productivity, operability and availability. If you have any |
| 142 | +question regarding the setup, I am available in Airflow’s |
| 143 | +[#airflow-creative](https://apache-airflow.slack.com/messages/airflow-creative) |
| 144 | +slack channel under the handle QP. You can get access to the slack workspace |
| 145 | +through [this link](https://apache-airflow-slack.herokuapp.com/). |
| 146 | + |
| 147 | +This is the second blog post from our series of [data pipeline |
| 148 | +migration](https://tech.scribd.com/blog/2020/modernizing-an-old-data-pipeline.html) |
| 149 | +blog posts. If this sounds interesting to you, the Core platform team is |
| 150 | +hiring! |
| 151 | + |
| 152 | +Come join us if you love building scalable data/ML platforms using open source |
| 153 | +technologies. :) |
| 154 | + |
| 155 | + |
| 156 | +[^1]: Ojinsync is a generic S3 to local file system sync daemon [open sourced](https://github.com/scribd/objinsync) by scribd. |
0 commit comments