Skip to content

Commit 3e08fc5

Browse files
authored
Merge pull request scribd#34 from scribd/data-eng
Introduce Data Engineering
2 parents 4a6cc37 + 627e873 commit 3e08fc5

File tree

2 files changed

+157
-0
lines changed

2 files changed

+157
-0
lines changed

_data/teams.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,11 @@ Data Science:
1515
Core Platform:
1616
lever: 'Core Platform'
1717

18+
Data Engineering:
19+
# No clue why these jobs are grouped with Core Platform in Lever, but not
20+
# really important to fix at the moment
21+
lever: 'Core Platform'
22+
1823
Core Infrastructure:
1924
lever: 'Core Infrastructure'
2025

_posts/2019-12-23-data-eng-in-2020.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
---
2+
layout: post
3+
title: "Growing Data Engineering into 2020"
4+
author: rtyler
5+
tags:
6+
- aws
7+
- spark
8+
- featured
9+
team: Data Engineering
10+
---
11+
12+
13+
As data sets grow and the needs of the business change, ingesting,
14+
transforming, and combining data becomes an area of focus unto itself.
15+
Data is cheap, understanding it is expensive: Data Engineering helps build that
16+
understanding.
17+
18+
The Data Engineering team at Scribd delivers text, analytics, and behavioral
19+
datasets almost "as a product" for internal customers, helping them build upon
20+
and understand the datasets that enable business and product operations. Our
21+
customers include the Product team, Business Analytics, Finance, and
22+
effectively the entire Engineering organization.
23+
24+
The history of growth for Scribd has been incremental and organic. For our
25+
data this has meant steady growth which can be deceptive. Organizations with
26+
rapid growth quickly see where data pipelines break down. More organic growth
27+
allows data pipelines to continue happily operating until they reach a tipping
28+
point where the amount of data exceeds the capacity or design of the pipeline.
29+
30+
To me the exciting things about Data Engineering are the size of the datasets
31+
and the incredible potential our work has to impact the future of Scribd.
32+
33+
In order to do that, we have to change and **grow**.
34+
35+
## Scaling Data Engineering
36+
37+
At the beginning of 2019 we did _not_ have a Data Engineering team, the need
38+
was not yet understood. When we started talking about the ambitions for the
39+
Data Science teams, the plans for the Business Analytics team, and what product
40+
initiatives we wanted to accomplish in 2020, the need became abundantly clear. We need
41+
Data Engineering to build the foundation for data usage across the company.
42+
43+
We have already hired some talented Data Engineers, but we need more talented
44+
people to help bring us into the cloud and enable new ideas we haven't even had
45+
yet! Scaling a team is one thing, but that is not the only way in which we need to scale.
46+
47+
48+
## Scaling our Approach
49+
50+
We are currently moving out of an on-premise managed data center into AWS.
51+
There's plenty of excitement in the teams that build Scribd.com and the
52+
services behind it. For the Data Science, Machine Learning, and Data
53+
Engineering teams at Scribd, AWS represents incredible potential. We have long
54+
been limited in the questions we can ask of our data by the fixed footprint of
55+
data center-based infrastructure. As our datasets move into S3 and our compute
56+
workloads move into [Databricks](https://databricks.com), we're already starting to identify new and
57+
interesting ways to ingest and examine the data in order to make Scribd more
58+
useful for our readers.
59+
60+
61+
A cloud-native data platform can simplify, but during the "zero-downtime"
62+
migration we are facing a number of interesting challenges.
63+
64+
* Running workloads between multiple data centers, while sharing data between
65+
them, requires more sophistication in how Data Engineering manages our
66+
catalog, which itself will soon grow into multiple catalogues.
67+
* Wasteful workloads are less noticeable in an on-premise environment. A job
68+
which inadvertently generates numerous small files, or a Spark job which
69+
performs excessive shuffle reads become more problematic in a usage-based
70+
pricing model. Time starts to matter _more_.
71+
* Technology skew between on-premise and the cloud forces a little more
72+
forethought. We can run the same versions of Spark in both places but our
73+
on-premise and cloud-based vendors are necessarily different. Using both
74+
temporarily increases our management complexity.
75+
76+
77+
The size of our datasets makes this project much more fun to work on. Migrating
78+
a data warehouse of a few hundred gigabytes without any downtime to its
79+
customers wouldn't be that hard. Doing the same migration with multiple
80+
petabytes of data across thousands of discrete tables, processed by hundreds of
81+
automated jobs is a _very_ different ballgame.
82+
83+
84+
Moving into a cloud-native environment also enables a number of new approaches
85+
and opportunities which by themselves will help us scale our data engineering
86+
practices. A unified data platform for business analytics, data science, and
87+
machine learning in the cloud can take advantage of completely different
88+
instance types for more optimized workloads. Easily spinning up and tearing
89+
down environments for exploration of new and different tools enables teams to
90+
use the tool that fits, as opposed to the tool that everybody else uses. This
91+
shift is coupled with a broader shift to the cloud and our architecture, will
92+
also opens the door to more stream processing of data rather than massive
93+
periodic batches.
94+
95+
96+
"The cloud" is not without its shortfalls, but across Data Engineering we're
97+
already seeing tremendous improvements by our cloud-centric approach.
98+
99+
100+
## Scaling our Quality
101+
102+
Supporting numerous parts of the business means that Data Engineering has to
103+
increase the quality of the datasets consumed. Some datasets have grown over
104+
time in their utility, where they were once "nice to have" they are now
105+
critical to certain business functions. This mandates that the pipelines which
106+
produce those datasets be treated just like the production software deployed to
107+
scribd.com.
108+
109+
I often think of the quote by [Cbarles Babbage](https://en.wikipedia.org/wiki/Charles_Babbage):
110+
111+
> On two occasions I have been asked,
112+
>
113+
> 'Pray, Mr. Babbage, if you put into the
114+
> machine wrong figures, will the right answers come out?'
115+
>
116+
> I am not able rightly to apprehend the kind of confusion of ideas that could
117+
> provoke such a question.
118+
119+
Data quality is a concern that anybody in the Data Engineering space is
120+
familiar. For Scribd I think "quality" on two axis:
121+
122+
* Integrity: is each record within this set formed the way the customer
123+
expects it, or in adherence with a predefined schema.
124+
* Lineage: is the pipeline of this dataset clear, monitored, and functioning
125+
properly to ensure my job continues to receive good inputs. Additionally,
126+
understanding when a pipeline contains personally-identifiable information, or
127+
other sensitive information which must have extra care added in order to
128+
safe-guard our readers' privacy.
129+
130+
Unfortunately data quality is an area where I think we need to substantial
131+
improvements. Data was at one time treated as a by-product of production
132+
systems. Now it is rightfully recognized as business-critical, and our
133+
practices must rise to meet the challenge.
134+
135+
AWS does not offer us any silver bullet to help scale our quality, fortunately
136+
however Scribd leadership recognizes the importance of both data and Data
137+
Engineering, so I'm confident we will be able to finish 2020 in a much better
138+
position.
139+
140+
141+
---
142+
143+
When Scribd first started "big data" was just coming into vogue. As
144+
the tools and practices available for working with data have changed, so too
145+
has Scribd. Our datasets are larger than they may appear from the outside:
146+
analytics from billions of requests each year combined with hundreds of million
147+
text documents are challenging to manage. These hefty datasets are also a challenge to make
148+
available, insightful, and of high quality. Data by itself tells us
149+
nothing, but well-managed data pipelines that allow us to identify characteristics
150+
of text documents, or content which is interesting to read, is incredibly
151+
valuable to Scribd. Data Engineering helps us understand our data which helps
152+
Scribd build products which deliver great reads to the world

0 commit comments

Comments
 (0)