Skip to content

Commit 2b21f54

Browse files
committed
Add @houqp's first blog post (yey
1 parent e4dd6ac commit 2b21f54

File tree

3 files changed

+126
-0
lines changed

3 files changed

+126
-0
lines changed

_data/authors.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,3 +66,7 @@ christianw:
6666
trinityx:
6767
name: Trinity Xia
6868
github: hnaoto
69+
70+
qphou:
71+
name: QP Hou
72+
github: houqp
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
---
2+
layout: post
3+
title: "Moderning a decades old data pipeline"
4+
author: qphou
5+
tags:
6+
- airflow
7+
- spark
8+
- featured
9+
- datapipe
10+
team: Core Platform
11+
---
12+
13+
Our massive data pipeline has helped us process enormous amounts of information
14+
over the past decade, all to help us help our users discover, read, and learn.
15+
In this blog series, I will share how we're upgrading our data pipeline to
16+
give internal customers faster, and more reliable results.
17+
18+
The data pipeline is currently managed using a home grown workflow
19+
orchestration system written in Ruby called "Datapipe." The first commit of our
20+
data pipeline repo dates all the way back to 2010. We created it
21+
around the time when everybody else was building their own orchestration tools,
22+
such as Pinterest's [Pinball](https://github.com/pinterest/pinball), Spotify's
23+
[Luigi](https://github.com/spotify/luigi), or AirBnB's
24+
[Airflow](https://airflow.apache.org/). These tools all perform the same
25+
basic function: process and execute a directed-acyclic-graph (DAG) of "worK",
26+
typically associated with a ETL data pipelines.
27+
28+
Today, we have 1500+ tasks and 14 DAGs, with the majority of tasks globbed
29+
together in one giant DAG containing more than 1400 tasks:
30+
31+
![It's a large DAG](/post-images/2020-02-airflow/dat-dag-tho.png)
32+
33+
34+
Datapipe has served us well and brought the company
35+
to where it is today. However, it has been in maintenance mode for some time.
36+
As a result, it's struggling to meet the needs of Scribd's fast growing
37+
engineering team,. Since [Scribd is moving more and more into the
38+
cloud](/blog/2019/migrating-kafka-to-aws.html),
39+
decided that now is a good time for us to step back and redesign the system for the
40+
future.
41+
42+
We need a modernized workflow orchestration system to help drastically improve
43+
productivity and unlock the capability to build new product features that were
44+
not previously possible.
45+
46+
47+
## Opportunity for improvement
48+
49+
Here are some of the areas we think would result in big impacts to the
50+
organization:
51+
52+
**Flexibility:** The in house system can only setup run schedule at the granularity
53+
of one day, which sets a limit on freshness of our data. To unlock new
54+
applications, we need to let engineers to define schedules with more
55+
flexibility and granularity.
56+
57+
**Productivity:** Culturally, we would like to shift from mono-repo to multi-repo.
58+
Needless to say, putting all the workflow definitions in a single file is not
59+
scalable. Our workflow config today already contains 6000 lines of code and is
60+
still growing. By building tooling to support multi-repo setup, we hope to
61+
reduce coupling and speed up development cycles.
62+
63+
**Ownership:** Today, we have dedicated engineers keeping eyes on nightly runs to
64+
notify workflow owners if anything goes wrong. The web UI doesn't some of the
65+
common maintenance actions like killing a running tasks. This, combined with
66+
lack of builtin monitoring and alerting support within the orchestration
67+
system, means even if workflow owners want to take full ownership of their
68+
tasks, there is no easy way to accomplish it. We need to flip this around and
69+
empower workflow owners to take care of their own tasks end to end. This is the
70+
only scalable way going forward.
71+
72+
**Scalability and availability:** The orchestration system should be able to handle
73+
the scale of data pipeline for many years to come. It should also be highly
74+
available and function without issue when a minority of the cluster goes down.
75+
76+
**Operability:** Minor failures in the pipeline should not impact the rest of the
77+
pipeline. Recovering failed tasks should be easy and fully self-serviced.
78+
79+
**Extensibility:** It's not surprising that after many years of development, the in
80+
house system comes with many unique and useful features like cross date
81+
dependencies. It should be easy to develop and maintain custom features for the
82+
new system.
83+
84+
**Cloud native:** As we migrate its infrastructure from a datacenter to the cloud, the
85+
new system will need to be able to run smoothly in the cloud and integrate
86+
nicely with various SASS offerings like Datadog, Pagerduty and Sentry.
87+
88+
We basically Had two options: retrofit Datapipe or
89+
pick a well maintained open source project as the building block. After lots of
90+
prototyping and careful evaluation, we decided to adopt [Apache Airflow](https://airflow.apache.org).
91+
92+
93+
## Super-charging Airflow
94+
95+
I wish adopting Airflow is just as simple as doing a `pip install` and pointing
96+
the config to a RDS endpoint. It turns out we had to do a lot of preparation
97+
work to make it meet all our requirements. Just to name a few:
98+
99+
* Implement scalable and highly available setup leveraging both ECS and EKS
100+
* Write tooling to support defining DAGs in multiple repositories
101+
* Scale Airflow to handle one of our gigantic DAG
102+
* Create custom Airflow plugins to replicate some of the unique features from the in house system
103+
* Build DAG delivery pipeline with a focus on speed and separation of environments
104+
* Monitor Airflow itself as well as DAGs and tasks with Datadog, Pagerduty and Sentry.
105+
* Execute multi-stage workflow migration from the in house system
106+
107+
Each one of the above items warrants a blog post of its own. We will be sharing
108+
what we have learned in more detail throughout this series of blog posts.
109+
110+
At Scribd, we embrace open source and try to contribute back to the community
111+
as much as we can. Since start of this internal project, we have contributed
112+
[more than 20 patches
113+
upstream](https://github.com/apache/airflow/pulls?utf8=%E2%9C%93&q=is%3Apr+author%3Ahouqp)
114+
to Airflow including EKS support, Pagerduty hooks, many bug fixes and
115+
performance improvements. We hope to continue this trend and contribute more as
116+
the project progresses.
117+
118+
119+
If this sounds interesting to you, the Core platform team is hiring!
120+
121+
Come join us if you love building scalable data/ML platforms using open source
122+
technologies. :)
810 KB
Loading

0 commit comments

Comments
 (0)