Skip to content

Commit 2b09f1d

Browse files
committed
Start the Kafka migration series by Christian
This has already been copy-edited and proofed from the internal wiki version
1 parent 855152a commit 2b09f1d

File tree

2 files changed

+138
-0
lines changed

2 files changed

+138
-0
lines changed

_data/authors.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,3 +58,7 @@ hamiltonh:
5858

5959
harinii:
6060
name: Harini Iyer
61+
62+
christianw:
63+
name: Christian Williams
64+
github: xianwill
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
layout: post
3+
title: "Migrating Kafka to the Cloud"
4+
author: christianw
5+
tags:
6+
- kafka
7+
- msk
8+
- featured
9+
- msk-series
10+
team: Core Platform
11+
---
12+
13+
[Apache Kafka](https://kafka.apache.org) has been part of Scribd's backend
14+
architecture for a long time but only recently has it become a cornerstone in
15+
the exciting future we imagine for ourselves. Persisting messages to topics and
16+
brokering them between both producers and consumers may seem like a dull job.
17+
However, when envisioning an event-driven service-oriented architecture and
18+
numerous stream processing applications, the role Kafka plays becomes exciting,
19+
empowering, and mission-critical. In order to realize this future, we had to
20+
move from our deprecated Kafka deployment, to a newer well-managed Apache
21+
Kafka environment in the cloud.
22+
23+
When I joined Scribd earlier this year, Kafka existed, and that's about it. We
24+
weren't using it to its potential. Part of the "Kafka avoidance" syndrome that
25+
existed, stemmed from the operational difficulties of _just_ running the thing.
26+
It was almost like we were afraid to touch Kafka for fear it might fall over.
27+
Another part of that avoidance grew out of the functionality not matching
28+
developers' expectations developer expectations. When we first adopted Kafka,
29+
ours was an on-premise deloyment of version **0.10**. Developers used it for a
30+
few projects, unexpected things occasionally happened that were difficult to
31+
"fix" and we started avoiding it for new projects.
32+
33+
As we considered what we needed to build the [Real-time Data
34+
Platform](/blog/2019/real-time-data-platform.html), Apache Kafka was an obvious
35+
choice. Across the industry companies are building and deploying fantastic
36+
streaming applications with Kafka, and we were confident that it was the right
37+
path forward, we only needed to dispel some of the emotional baggage around our
38+
legacy cluster.
39+
40+
This is the first in a series of posts where we will describe the steps we took
41+
at Scribd to migrate our existing on-premise workloads to [Managed
42+
Streams for Kafka](https://aws.amazon.com/msk/) (MSK) in AWS. In this introductory
43+
article, I will focus on the initial work we did to quantify our existing
44+
workloads and setup our evaluation.
45+
46+
## Kicking the Wheels
47+
48+
Reducing the operational complexity and costs meant we needed to evaluate Kafka cloud and SaaS providers.
49+
Our next step was to evaluate the offerings, contrasting their pros and cons.
50+
Price certainly leaps to mind as one of the comparisons to make, but since we
51+
were focused on features and architecture, we deferred all price comparisons.
52+
I believe that this is actually the only way to do a valid comparison between
53+
cloud providers, since managed service providers do not all offer the same
54+
features.
55+
56+
We focused on a few questions to anchor the evaluation:
57+
58+
* **How will the platform handle our existing workloads?** Naturally we wanted to
59+
migrate our existing producers and consumers to the new service, so they
60+
needed to integrate smoothly.
61+
* **How will we grow new workloads in the platform?** Our pre-existing
62+
workloads are only the start, we have high ambitions for what our Kafka-based
63+
future looks like, and the provider would need to grow with our needs.
64+
* **How well can we secure and manage it?** Our
65+
legacy on-premise deployment was wide-open internally 😱. Client applications
66+
did not need to authenticate to write messages to a specific topic. They only
67+
needed network access. In our new cluster, we wanted to make sure each client
68+
was constrained by ACLs (Access Control Lists).
69+
70+
### Prototype Approach
71+
72+
To get some hand-on experience with each platform, we implemented a prototype
73+
to represent one of our larger analytics workloads. Not only was it a big data stream to work with, but it also would benefit from some extensions we could imagine for the future, such as message debatching, routing and validation streams. The
74+
prototype included "event players" to send fake messages at a rate similar to
75+
what we actually receive in production, as well as mock downstream
76+
applications which applied additional stream processing duties.
77+
78+
Each "add-on" extension to the existing stream was implemented as a
79+
[KStreams](https://kafka.apache.org/10/documentation/streams/developer-guide/)
80+
application, reading from the ingress stream and writing to a new downstream
81+
topic. The end-to-end pipeline was realized by this series of KStreams
82+
applications, enriching or directing the original message in some way before
83+
writing it to a topic. From a consumer perspective, this flexibility in the
84+
pipeline was very important to us. Consumers who only want validated data
85+
could subscribe to the "validated" topic to get messages in real-time, whereas
86+
consumers who wanted the firehose could hook up to the ingress topic.
87+
88+
The final version of our prototype included enough transforms to magnify the
89+
original workload traffic by about 4X. This provided us with useful information
90+
for capacity planning and vendor evaluation, since it gave us a way to see how
91+
the enrichments we desired in our pipeline might impact the overall load and
92+
storage demands on the cluster. It also gave us some perspective on the
93+
complexity of operating multiple streaming applications to deliver higher
94+
quality data in real-time.
95+
96+
### Vendor Evaluation
97+
98+
Determining the vendor for the future of your streaming data platform is no
99+
light-hearted decision! For each vendor, we went created the topics and ACLs necessary
100+
to support our prototype pipeline, and then we ran the prototype
101+
workload against a cluster created on each vendor's platform.
102+
103+
Configuring ACLs and Kafka client properties, the "authentication mode support"
104+
stood out as one of the big differences AWS MSK and other providers.
105+
AWS MSK
106+
**only** supports TLS authentication using client certificates and [AWS PCAs](https://docs.aws.amazon.com/acm-pca/latest/userguide/PcaWelcome.html)
107+
(Private Certificate Authorities)
108+
TLS authentication is a bit painful at the outset, but as we got more
109+
comfortable it became less problematic. The cluster-side for TLS authentication
110+
is _much_ easier with AWS MSK than attempting to implement client certificates
111+
in a traditional on-premise Kafka deployment. A separate post in this series
112+
will go into more detail related to our implementation of Kafka TLS
113+
authentication.
114+
115+
Another major consideration for us has been _monitoring_. AWS MSK turned out
116+
to be a much better fit for us than others, since we were able to pull metrics
117+
directly from MSK clusters into [Datadog](https://datadoghq.com). This allowed
118+
us to view MSK metrics together with our other operational metics. Datadog's
119+
own [MSK integration](https://docs.datadoghq.com/integrations/amazon_msk/) made
120+
the integration not much harder than a couple button clicks.
121+
122+
123+
---
124+
125+
Feeling confident that we knew how to secure, manage, and grow our Kafka
126+
environments across a number of available offerings, we ultimately settled on
127+
AWS MSK to power the streaming platform of Scribd's future.
128+
129+
I look forward to sharing more about our Kafka migration into the cloud in
130+
subsequent posts in this series. We'll dive deeper into TLS authentication with
131+
MSK, discuss how we prepared our existing producers and consumers for migration
132+
from our old 0.10 Kafka to a _much_ newer version (2.2.1), and how we ended up
133+
implementing the gradual rollover to the new cluster with no data loss and no
134+
downtime!

0 commit comments

Comments
 (0)