|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Moderning a decades old data pipeline" |
| 4 | +author: qphou |
| 5 | +tags: |
| 6 | +- airflow |
| 7 | +- spark |
| 8 | +- featured |
| 9 | +- datapipe |
| 10 | +team: Core Platform |
| 11 | +--- |
| 12 | + |
| 13 | +Our massive data pipeline has helped us process enormous amounts of information |
| 14 | +over the past decade, all to help us help our users discover, read, and learn. |
| 15 | +In this blog series, I will share how we're upgrading our data pipeline to |
| 16 | +give internal customers faster, and more reliable results. |
| 17 | + |
| 18 | +The data pipeline is currently managed using a home grown workflow |
| 19 | +orchestration system written in Ruby called "Datapipe." The first commit of our |
| 20 | +data pipeline repo dates all the way back to 2010. We created it |
| 21 | +around the time when everybody else was building their own orchestration tools, |
| 22 | +such as Pinterest's [Pinball](https://github.com/pinterest/pinball), Spotify's |
| 23 | +[Luigi](https://github.com/spotify/luigi), or AirBnB's |
| 24 | +[Airflow](https://airflow.apache.org/). These tools all perform the same |
| 25 | +basic function: process and execute a directed-acyclic-graph (DAG) of "worK", |
| 26 | +typically associated with a ETL data pipelines. |
| 27 | + |
| 28 | +Today, we have 1500+ tasks and 14 DAGs, with the majority of tasks globbed |
| 29 | +together in one giant DAG containing more than 1400 tasks: |
| 30 | + |
| 31 | + |
| 32 | + |
| 33 | + |
| 34 | +Datapipe has served us well and brought the company |
| 35 | +to where it is today. However, it has been in maintenance mode for some time. |
| 36 | +As a result, it's struggling to meet the needs of Scribd's fast growing |
| 37 | +engineering team,. Since [Scribd is moving more and more into the |
| 38 | +cloud](/blog/2019/migrating-kafka-to-aws.html), |
| 39 | +decided that now is a good time for us to step back and redesign the system for the |
| 40 | +future. |
| 41 | + |
| 42 | +We need a modernized workflow orchestration system to help drastically improve |
| 43 | +productivity and unlock the capability to build new product features that were |
| 44 | +not previously possible. |
| 45 | + |
| 46 | + |
| 47 | +## Opportunity for improvement |
| 48 | + |
| 49 | +Here are some of the areas we think would result in big impacts to the |
| 50 | +organization: |
| 51 | + |
| 52 | +**Flexibility:** The in house system can only setup run schedule at the granularity |
| 53 | +of one day, which sets a limit on freshness of our data. To unlock new |
| 54 | +applications, we need to let engineers to define schedules with more |
| 55 | +flexibility and granularity. |
| 56 | + |
| 57 | +**Productivity:** Culturally, we would like to shift from mono-repo to multi-repo. |
| 58 | +Needless to say, putting all the workflow definitions in a single file is not |
| 59 | +scalable. Our workflow config today already contains 6000 lines of code and is |
| 60 | +still growing. By building tooling to support multi-repo setup, we hope to |
| 61 | +reduce coupling and speed up development cycles. |
| 62 | + |
| 63 | +**Ownership:** Today, we have dedicated engineers keeping eyes on nightly runs to |
| 64 | +notify workflow owners if anything goes wrong. The web UI doesn't some of the |
| 65 | +common maintenance actions like killing a running tasks. This, combined with |
| 66 | +lack of builtin monitoring and alerting support within the orchestration |
| 67 | +system, means even if workflow owners want to take full ownership of their |
| 68 | +tasks, there is no easy way to accomplish it. We need to flip this around and |
| 69 | +empower workflow owners to take care of their own tasks end to end. This is the |
| 70 | +only scalable way going forward. |
| 71 | + |
| 72 | +**Scalability and availability:** The orchestration system should be able to handle |
| 73 | +the scale of data pipeline for many years to come. It should also be highly |
| 74 | +available and function without issue when a minority of the cluster goes down. |
| 75 | + |
| 76 | +**Operability:** Minor failures in the pipeline should not impact the rest of the |
| 77 | +pipeline. Recovering failed tasks should be easy and fully self-serviced. |
| 78 | + |
| 79 | +**Extensibility:** It's not surprising that after many years of development, the in |
| 80 | +house system comes with many unique and useful features like cross date |
| 81 | +dependencies. It should be easy to develop and maintain custom features for the |
| 82 | +new system. |
| 83 | + |
| 84 | +**Cloud native:** As we migrate its infrastructure from a datacenter to the cloud, the |
| 85 | +new system will need to be able to run smoothly in the cloud and integrate |
| 86 | +nicely with various SASS offerings like Datadog, Pagerduty and Sentry. |
| 87 | + |
| 88 | +We basically Had two options: retrofit Datapipe or |
| 89 | +pick a well maintained open source project as the building block. After lots of |
| 90 | +prototyping and careful evaluation, we decided to adopt [Apache Airflow](https://airflow.apache.org). |
| 91 | + |
| 92 | + |
| 93 | +## Super-charging Airflow |
| 94 | + |
| 95 | +I wish adopting Airflow is just as simple as doing a `pip install` and pointing |
| 96 | +the config to a RDS endpoint. It turns out we had to do a lot of preparation |
| 97 | +work to make it meet all our requirements. Just to name a few: |
| 98 | + |
| 99 | +* Implement scalable and highly available setup leveraging both ECS and EKS |
| 100 | +* Write tooling to support defining DAGs in multiple repositories |
| 101 | +* Scale Airflow to handle one of our gigantic DAG |
| 102 | +* Create custom Airflow plugins to replicate some of the unique features from the in house system |
| 103 | +* Build DAG delivery pipeline with a focus on speed and separation of environments |
| 104 | +* Monitor Airflow itself as well as DAGs and tasks with Datadog, Pagerduty and Sentry. |
| 105 | +* Execute multi-stage workflow migration from the in house system |
| 106 | + |
| 107 | +Each one of the above items warrants a blog post of its own. We will be sharing |
| 108 | +what we have learned in more detail throughout this series of blog posts. |
| 109 | + |
| 110 | +At Scribd, we embrace open source and try to contribute back to the community |
| 111 | +as much as we can. Since start of this internal project, we have contributed |
| 112 | +[more than 20 patches |
| 113 | +upstream](https://github.com/apache/airflow/pulls?utf8=%E2%9C%93&q=is%3Apr+author%3Ahouqp) |
| 114 | +to Airflow including EKS support, Pagerduty hooks, many bug fixes and |
| 115 | +performance improvements. We hope to continue this trend and contribute more as |
| 116 | +the project progresses. |
| 117 | + |
| 118 | + |
| 119 | +If this sounds interesting to you, the Core platform team is hiring! |
| 120 | + |
| 121 | +Come join us if you love building scalable data/ML platforms using open source |
| 122 | +technologies. :) |
0 commit comments