|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Growing Data Engineering into 2020" |
| 4 | +author: rtyler |
| 5 | +tags: |
| 6 | +- aws |
| 7 | +- spark |
| 8 | +- featured |
| 9 | +team: Data Engineering |
| 10 | +--- |
| 11 | + |
| 12 | + |
| 13 | +As data sets grow and the needs of the business change, ingesting, |
| 14 | +transforming, and combining data becomes an area of focus unto itself. |
| 15 | +Data is cheap, understanding it is expensive: Data Engineering helps build that |
| 16 | +understanding. |
| 17 | + |
| 18 | +The Data Engineering team at Scribd delivers text, analytics, and behavioral |
| 19 | +datasets almost "as a product" for internal customers, helping them build upon |
| 20 | +and understand the datasets that enable business and product operations. Our |
| 21 | +customers include the Product team, Business Analytics, Finance, and |
| 22 | +effectively the entire Engineering organization. |
| 23 | + |
| 24 | +The history of growth for Scribd has been incremental and organic. For our |
| 25 | +data this has meant steady growth which can be deceptive. Organizations with |
| 26 | +rapid growth quickly see where data pipelines break down. More organic growth |
| 27 | +allows data pipelines to continue happily operating until they reach a tipping |
| 28 | +point where the amount of data exceeds the capacity or design of the pipeline. |
| 29 | + |
| 30 | +To me the exciting things about Data Engineering are the size of the datasets |
| 31 | +and the incredible potential our work has to impact the future of Scribd. |
| 32 | + |
| 33 | +In order to do that, we have to change and **grow**. |
| 34 | + |
| 35 | +## Scaling Data Engineering |
| 36 | + |
| 37 | +At the beginning of 2019 we did _not_ have a Data Engineering team, the need |
| 38 | +was not yet understood. When we started talking about the ambitions for the |
| 39 | +Data Science teams, the plans for the Business Analytics team, and what product |
| 40 | +initiatives we wanted to accomplish in 2020, the need became abundantly clear. We need |
| 41 | +Data Engineering to build the foundation for data usage across the company. |
| 42 | + |
| 43 | +We have already hired some talented Data Engineers, but we need more talented |
| 44 | +people to help bring us into the cloud and enable new ideas we haven't even had |
| 45 | +yet! Scaling a team is one thing, but that is not the only way in which we need to scale. |
| 46 | + |
| 47 | + |
| 48 | +## Scaling our Approach |
| 49 | + |
| 50 | +We are currently moving out of an on-premise managed data center into AWS. |
| 51 | +There's plenty of excitement in the teams that build Scribd.com and the |
| 52 | +services behind it. For the Data Science, Machine Learning, and Data |
| 53 | +Engineering teams at Scribd, AWS represents incredible potential. We have long |
| 54 | +been limited in the questions we can ask of our data by the fixed footprint of |
| 55 | +data center-based infrastructure. As our datasets move into S3 and our compute |
| 56 | +workloads move into [Databricks](https://databricks.com), we're already starting to identify new and |
| 57 | +interesting ways to ingest and examine the data in order to make Scribd more |
| 58 | +useful for our readers. |
| 59 | + |
| 60 | + |
| 61 | +A cloud-native data platform can simplify, but during the "zero-downtime" |
| 62 | +migration we are facing a number of interesting challenges. |
| 63 | + |
| 64 | +* Running workloads between multiple data centers, while sharing data between |
| 65 | + them, requires more sophistication in how Data Engineering manages our |
| 66 | + catalog, which itself will soon grow into multiple catalogues. |
| 67 | +* Wasteful workloads are less noticeable in an on-premise environment. A job |
| 68 | + which inadvertently generates numerous small files, or a Spark job which |
| 69 | + performs excessive shuffle reads become more problematic in a usage-based |
| 70 | + pricing model. Time starts to matter _more_. |
| 71 | +* Technology skew between on-premise and the cloud forces a little more |
| 72 | + forethought. We can run the same versions of Spark in both places but our |
| 73 | + on-premise and cloud-based vendors are necessarily different. Using both |
| 74 | + temporarily increases our management complexity. |
| 75 | + |
| 76 | + |
| 77 | +The size of our datasets makes this project much more fun to work on. Migrating |
| 78 | +a data warehouse of a few hundred gigabytes without any downtime to its |
| 79 | +customers wouldn't be that hard. Doing the same migration with multiple |
| 80 | +petabytes of data across thousands of discrete tables, processed by hundreds of |
| 81 | +automated jobs is a _very_ different ballgame. |
| 82 | + |
| 83 | + |
| 84 | +Moving into a cloud-native environment also enables a number of new approaches |
| 85 | +and opportunities which by themselves will help us scale our data engineering |
| 86 | +practices. A unified data platform for business analytics, data science, and |
| 87 | +machine learning in the cloud can take advantage of completely different |
| 88 | +instance types for more optimized workloads. Easily spinning up and tearing |
| 89 | +down environments for exploration of new and different tools enables teams to |
| 90 | +use the tool that fits, as opposed to the tool that everybody else uses. This |
| 91 | +shift is coupled with a broader shift to the cloud and our architecture, will |
| 92 | +also opens the door to more stream processing of data rather than massive |
| 93 | +periodic batches. |
| 94 | + |
| 95 | + |
| 96 | +"The cloud" is not without its shortfalls, but across Data Engineering we're |
| 97 | +already seeing tremendous improvements by our cloud-centric approach. |
| 98 | + |
| 99 | + |
| 100 | +## Scaling our Quality |
| 101 | + |
| 102 | +Supporting numerous parts of the business means that Data Engineering has to |
| 103 | +increase the quality of the datasets consumed. Some datasets have grown over |
| 104 | +time in their utility, where they were once "nice to have" they are now |
| 105 | +critical to certain business functions. This mandates that the pipelines which |
| 106 | +produce those datasets be treated just like the production software deployed to |
| 107 | +scribd.com. |
| 108 | + |
| 109 | +I often think of the quote by [Cbarles Babbage](https://en.wikipedia.org/wiki/Charles_Babbage): |
| 110 | + |
| 111 | +> On two occasions I have been asked, |
| 112 | +> |
| 113 | +> 'Pray, Mr. Babbage, if you put into the |
| 114 | +> machine wrong figures, will the right answers come out?' |
| 115 | +> |
| 116 | +> I am not able rightly to apprehend the kind of confusion of ideas that could |
| 117 | +> provoke such a question. |
| 118 | +
|
| 119 | +Data quality is a concern that anybody in the Data Engineering space is |
| 120 | +familiar. For Scribd I think "quality" on two axis: |
| 121 | + |
| 122 | +* Integrity: is each record within this set formed the way the customer |
| 123 | + expects it, or in adherence with a predefined schema. |
| 124 | +* Lineage: is the pipeline of this dataset clear, monitored, and functioning |
| 125 | + properly to ensure my job continues to receive good inputs. Additionally, |
| 126 | + understanding when a pipeline contains personally-identifiable information, or |
| 127 | + other sensitive information which must have extra care added in order to |
| 128 | + safe-guard our readers' privacy. |
| 129 | + |
| 130 | +Unfortunately data quality is an area where I think we need to substantial |
| 131 | +improvements. Data was at one time treated as a by-product of production |
| 132 | +systems. Now it is rightfully recognized as business-critical, and our |
| 133 | +practices must rise to meet the challenge. |
| 134 | + |
| 135 | +AWS does not offer us any silver bullet to help scale our quality, fortunately |
| 136 | +however Scribd leadership recognizes the importance of both data and Data |
| 137 | +Engineering, so I'm confident we will be able to finish 2020 in a much better |
| 138 | +position. |
| 139 | + |
| 140 | + |
| 141 | +--- |
| 142 | + |
| 143 | +When Scribd first started "big data" was just coming into vogue. As |
| 144 | +the tools and practices available for working with data have changed, so too |
| 145 | +has Scribd. Our datasets are larger than they may appear from the outside: |
| 146 | +analytics from billions of requests each year combined with hundreds of million |
| 147 | +text documents are challenging to manage. These hefty datasets are also a challenge to make |
| 148 | +available, insightful, and of high quality. Data by itself tells us |
| 149 | +nothing, but well-managed data pipelines that allow us to identify characteristics |
| 150 | +of text documents, or content which is interesting to read, is incredibly |
| 151 | +valuable to Scribd. Data Engineering helps us understand our data which helps |
| 152 | +Scribd build products which deliver great reads to the world |
0 commit comments