xianwill
diff --git a/‎_data/teams.yml
Lines changed: 5 additions & 0 deletions b/‎_data/teams.yml
Lines changed: 5 additions & 0 deletions
diff --git a/‎_posts/2019-12-23-data-eng-in-2020.md
Lines changed: 152 additions & 0 deletions b/‎_posts/2019-12-23-data-eng-in-2020.md
Lines changed: 152 additions & 0 deletions
@@ -15,6 +15,11 @@ Data Science:
 Core Platform:
   lever: 'Core Platform'
 
+Data Engineering:
+  # No clue why these jobs are grouped with Core Platform in Lever, but not
+  # really important to fix at the moment
+  lever: 'Core Platform'
+
 Core Infrastructure:
   lever: 'Core Infrastructure'
 
 
@@ -0,0 +1,152 @@
+---
+layout: post
+title:  "Growing Data Engineering into 2020"
+author: rtyler
+tags:
+- aws
+- spark
+- featured
+team: Data Engineering
+---
+
+
+As data sets grow and the needs of the business change, ingesting,
+transforming, and combining data becomes an area of focus unto itself.
+Data is cheap, understanding it is expensive: Data Engineering helps build that
+understanding.
+
+The Data Engineering team at Scribd delivers text, analytics, and behavioral
+datasets almost "as a product" for internal customers, helping them build upon
+and understand the datasets that enable business and product operations. Our
+customers include the Product team, Business Analytics, Finance, and
+effectively the entire Engineering organization.
+
+The history of growth for Scribd has been incremental and organic. For our
+data this has meant steady growth which can be deceptive. Organizations with
+rapid growth quickly see where data pipelines break down. More organic growth
+allows data pipelines to continue happily operating until they reach a tipping
+point where the amount of data exceeds the capacity or design of the pipeline.
+
+To me the exciting things about Data Engineering are the size of the datasets
+and the incredible potential our work has to impact the future of Scribd.
+
+In order to do that, we have to change and **grow**.
+
+## Scaling Data Engineering
+
+At the beginning of 2019 we did _not_ have a Data Engineering team, the need
+was not yet understood. When we started talking about the ambitions for the
+Data Science teams, the plans for the Business Analytics team, and what product
+initiatives we wanted to accomplish in 2020, the need became abundantly clear. We need
+Data Engineering to build the foundation for data usage across the company.
+
+We have already hired some talented Data Engineers, but we need more talented
+people to help bring us into the cloud and enable new ideas we haven't even had
+yet! Scaling a team is one thing, but that is not the only way in which we need to scale.
+
+
+## Scaling our Approach
+
+We are currently moving out of an on-premise managed data center into AWS.
+There's plenty of excitement in the teams that build Scribd.com and the
+services behind it. For the Data Science, Machine Learning, and Data
+Engineering teams at Scribd, AWS represents incredible potential. We have long
+been limited in the questions we can ask of our data by the fixed footprint of
+data center-based infrastructure. As our datasets move into S3 and our compute
+workloads move into [Databricks](https://databricks.com), we're already starting to identify new and
+interesting ways to ingest and examine the data in order to make Scribd more
+useful for our readers.
+
+
+A cloud-native data platform can simplify, but during the "zero-downtime"
+migration we are facing a number of interesting challenges.
+
+* Running workloads between multiple data centers, while sharing data between
+  them, requires more sophistication in how Data Engineering manages our
+  catalog, which itself will soon grow into multiple catalogues.
+* Wasteful workloads are less noticeable in an on-premise environment. A job
+  which inadvertently generates numerous small files, or a Spark job which
+  performs excessive shuffle reads become more problematic in a usage-based
+  pricing model. Time starts to matter _more_.
+* Technology skew between on-premise and the cloud forces a little more
+  forethought. We can run the same versions of Spark in both places but our
+  on-premise and cloud-based vendors are necessarily different. Using both
+  temporarily increases our management complexity.
+
+
+The size of our datasets makes this project much more fun to work on. Migrating
+a data warehouse of a few hundred gigabytes without any downtime to its
+customers wouldn't be that hard. Doing the same migration with multiple
+petabytes of data across thousands of discrete tables, processed by hundreds of
+automated jobs is a _very_ different ballgame.
+
+
+Moving into a cloud-native environment also enables a number of new approaches
+and opportunities which by themselves will help us scale our data engineering
+practices. A unified data platform for business analytics, data science, and
+machine learning in the cloud can take advantage of completely different
+instance types for more optimized workloads. Easily spinning up and tearing
+down environments for exploration of new and different tools enables teams to
+use the tool that fits, as opposed to the tool that everybody else uses. This
+shift is coupled with a broader shift to the cloud and our architecture, will
+also opens the door to more stream processing of data rather than massive
+periodic batches.
+
+
+"The cloud" is not without its shortfalls, but across Data Engineering we're
+already seeing tremendous improvements by our cloud-centric approach.
+
+
+## Scaling our Quality
+
+Supporting numerous parts of the business means that Data Engineering has to
+increase the quality of the datasets consumed. Some datasets have grown over
+time in their utility, where they were once "nice to have" they are now
+critical to certain business functions. This mandates that the pipelines which
+produce those datasets be treated just like the production software deployed to
+scribd.com.
+
+I often think of the quote by [Cbarles Babbage](https://en.wikipedia.org/wiki/Charles_Babbage):
+
+> On two occasions I have been asked,
+>
+> 'Pray, Mr. Babbage, if you put into the
+> machine wrong figures, will the right answers come out?'
+>
+> I am not able rightly to apprehend the kind of confusion of ideas that could
+> provoke such a question.
+
+Data quality is a concern that anybody in the Data Engineering space is
+familiar. For Scribd I think "quality" on two axis:
+
+* Integrity: is each record within this set formed the way the customer
+  expects it, or in adherence with a predefined schema.
+* Lineage: is the pipeline of this dataset clear, monitored, and functioning
+  properly to ensure my job continues to receive good inputs. Additionally,
+  understanding when a pipeline contains personally-identifiable information, or
+  other sensitive information which must have extra care added in order to
+  safe-guard our readers' privacy.
+
+Unfortunately data quality is an area where I think we need to substantial
+improvements. Data was at one time treated as a by-product of production
+systems. Now it is rightfully recognized as business-critical, and our
+practices must rise to meet the challenge.
+
+AWS does not offer us any silver bullet to help scale our quality, fortunately
+however Scribd leadership recognizes the importance of both data and Data
+Engineering, so I'm confident we will be able to finish 2020 in a much better
+position.
+
+
+---
+
+When Scribd first started "big data" was just coming into vogue. As
+the tools and practices available for working with data have changed, so too
+has Scribd. Our datasets are larger than they may appear from the outside:
+analytics from billions of requests each year combined with hundreds of million
+text documents are challenging to manage. These hefty datasets are also a challenge to make
+available, insightful, and of high quality. Data by itself tells us
+nothing, but well-managed data pipelines that allow us to identify characteristics
+of text documents, or content which is interesting to read, is incredibly
+valuable to Scribd. Data Engineering helps us understand our data which helps
+Scribd build products which deliver great reads to the world