|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Migrating Kafka to the Cloud" |
| 4 | +author: christianw |
| 5 | +tags: |
| 6 | +- kafka |
| 7 | +- msk |
| 8 | +- featured |
| 9 | +- msk-series |
| 10 | +team: Core Platform |
| 11 | +--- |
| 12 | + |
| 13 | +[Apache Kafka](https://kafka.apache.org) has been part of Scribd's backend |
| 14 | +architecture for a long time but only recently has it become a cornerstone in |
| 15 | +the exciting future we imagine for ourselves. Persisting messages to topics and |
| 16 | +brokering them between both producers and consumers may seem like a dull job. |
| 17 | +However, when envisioning an event-driven service-oriented architecture and |
| 18 | +numerous stream processing applications, the role Kafka plays becomes exciting, |
| 19 | +empowering, and mission-critical. In order to realize this future, we had to |
| 20 | +move from our deprecated Kafka deployment, to a newer well-managed Apache |
| 21 | +Kafka environment in the cloud. |
| 22 | + |
| 23 | +When I joined Scribd earlier this year, Kafka existed, and that's about it. We |
| 24 | +weren't using it to its potential. Part of the "Kafka avoidance" syndrome that |
| 25 | +existed, stemmed from the operational difficulties of _just_ running the thing. |
| 26 | +It was almost like we were afraid to touch Kafka for fear it might fall over. |
| 27 | +Another part of that avoidance grew out of the functionality not matching |
| 28 | +developers' expectations developer expectations. When we first adopted Kafka, |
| 29 | +ours was an on-premise deloyment of version **0.10**. Developers used it for a |
| 30 | +few projects, unexpected things occasionally happened that were difficult to |
| 31 | +"fix" and we started avoiding it for new projects. |
| 32 | + |
| 33 | +As we considered what we needed to build the [Real-time Data |
| 34 | +Platform](/blog/2019/real-time-data-platform.html), Apache Kafka was an obvious |
| 35 | +choice. Across the industry companies are building and deploying fantastic |
| 36 | +streaming applications with Kafka, and we were confident that it was the right |
| 37 | +path forward, we only needed to dispel some of the emotional baggage around our |
| 38 | +legacy cluster. |
| 39 | + |
| 40 | +This is the first in a series of posts where we will describe the steps we took |
| 41 | +at Scribd to migrate our existing on-premise workloads to [Managed |
| 42 | +Streams for Kafka](https://aws.amazon.com/msk/) (MSK) in AWS. In this introductory |
| 43 | +article, I will focus on the initial work we did to quantify our existing |
| 44 | +workloads and setup our evaluation. |
| 45 | + |
| 46 | +## Kicking the Wheels |
| 47 | + |
| 48 | +Reducing the operational complexity and costs meant we needed to evaluate Kafka cloud and SaaS providers. |
| 49 | +Our next step was to evaluate the offerings, contrasting their pros and cons. |
| 50 | +Price certainly leaps to mind as one of the comparisons to make, but since we |
| 51 | +were focused on features and architecture, we deferred all price comparisons. |
| 52 | +I believe that this is actually the only way to do a valid comparison between |
| 53 | +cloud providers, since managed service providers do not all offer the same |
| 54 | +features. |
| 55 | + |
| 56 | +We focused on a few questions to anchor the evaluation: |
| 57 | + |
| 58 | +* **How will the platform handle our existing workloads?** Naturally we wanted to |
| 59 | + migrate our existing producers and consumers to the new service, so they |
| 60 | + needed to integrate smoothly. |
| 61 | +* **How will we grow new workloads in the platform?** Our pre-existing |
| 62 | + workloads are only the start, we have high ambitions for what our Kafka-based |
| 63 | + future looks like, and the provider would need to grow with our needs. |
| 64 | +* **How well can we secure and manage it?** Our |
| 65 | + legacy on-premise deployment was wide-open internally 😱. Client applications |
| 66 | + did not need to authenticate to write messages to a specific topic. They only |
| 67 | + needed network access. In our new cluster, we wanted to make sure each client |
| 68 | + was constrained by ACLs (Access Control Lists). |
| 69 | + |
| 70 | +### Prototype Approach |
| 71 | + |
| 72 | +To get some hand-on experience with each platform, we implemented a prototype |
| 73 | +to represent one of our larger analytics workloads. Not only was it a big data stream to work with, but it also would benefit from some extensions we could imagine for the future, such as message debatching, routing and validation streams. The |
| 74 | +prototype included "event players" to send fake messages at a rate similar to |
| 75 | +what we actually receive in production, as well as mock downstream |
| 76 | +applications which applied additional stream processing duties. |
| 77 | + |
| 78 | +Each "add-on" extension to the existing stream was implemented as a |
| 79 | +[KStreams](https://kafka.apache.org/10/documentation/streams/developer-guide/) |
| 80 | +application, reading from the ingress stream and writing to a new downstream |
| 81 | +topic. The end-to-end pipeline was realized by this series of KStreams |
| 82 | +applications, enriching or directing the original message in some way before |
| 83 | +writing it to a topic. From a consumer perspective, this flexibility in the |
| 84 | +pipeline was very important to us. Consumers who only want validated data |
| 85 | +could subscribe to the "validated" topic to get messages in real-time, whereas |
| 86 | +consumers who wanted the firehose could hook up to the ingress topic. |
| 87 | + |
| 88 | +The final version of our prototype included enough transforms to magnify the |
| 89 | +original workload traffic by about 4X. This provided us with useful information |
| 90 | +for capacity planning and vendor evaluation, since it gave us a way to see how |
| 91 | +the enrichments we desired in our pipeline might impact the overall load and |
| 92 | +storage demands on the cluster. It also gave us some perspective on the |
| 93 | +complexity of operating multiple streaming applications to deliver higher |
| 94 | +quality data in real-time. |
| 95 | + |
| 96 | +### Vendor Evaluation |
| 97 | + |
| 98 | +Determining the vendor for the future of your streaming data platform is no |
| 99 | +light-hearted decision! For each vendor, we went created the topics and ACLs necessary |
| 100 | +to support our prototype pipeline, and then we ran the prototype |
| 101 | +workload against a cluster created on each vendor's platform. |
| 102 | + |
| 103 | +Configuring ACLs and Kafka client properties, the "authentication mode support" |
| 104 | +stood out as one of the big differences AWS MSK and other providers. |
| 105 | +AWS MSK |
| 106 | +**only** supports TLS authentication using client certificates and [AWS PCAs](https://docs.aws.amazon.com/acm-pca/latest/userguide/PcaWelcome.html) |
| 107 | +(Private Certificate Authorities) |
| 108 | +TLS authentication is a bit painful at the outset, but as we got more |
| 109 | +comfortable it became less problematic. The cluster-side for TLS authentication |
| 110 | +is _much_ easier with AWS MSK than attempting to implement client certificates |
| 111 | +in a traditional on-premise Kafka deployment. A separate post in this series |
| 112 | +will go into more detail related to our implementation of Kafka TLS |
| 113 | +authentication. |
| 114 | + |
| 115 | +Another major consideration for us has been _monitoring_. AWS MSK turned out |
| 116 | +to be a much better fit for us than others, since we were able to pull metrics |
| 117 | +directly from MSK clusters into [Datadog](https://datadoghq.com). This allowed |
| 118 | +us to view MSK metrics together with our other operational metics. Datadog's |
| 119 | +own [MSK integration](https://docs.datadoghq.com/integrations/amazon_msk/) made |
| 120 | +the integration not much harder than a couple button clicks. |
| 121 | + |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +Feeling confident that we knew how to secure, manage, and grow our Kafka |
| 126 | +environments across a number of available offerings, we ultimately settled on |
| 127 | +AWS MSK to power the streaming platform of Scribd's future. |
| 128 | + |
| 129 | +I look forward to sharing more about our Kafka migration into the cloud in |
| 130 | +subsequent posts in this series. We'll dive deeper into TLS authentication with |
| 131 | +MSK, discuss how we prepared our existing producers and consumers for migration |
| 132 | +from our old 0.10 Kafka to a _much_ newer version (2.2.1), and how we ended up |
| 133 | +implementing the gradual rollover to the new cluster with no data loss and no |
| 134 | +downtime! |
0 commit comments