Skip to content

Commit e1c969f

Browse files
committed
Add the next blog post in the Kafka series, capturing the rollover approach
1 parent e5c95d3 commit e1c969f

File tree

6 files changed

+197
-0
lines changed

6 files changed

+197
-0
lines changed

_data/authors.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,3 +62,7 @@ harinii:
6262
christianw:
6363
name: Christian Williams
6464
github: xianwill
65+
66+
trinityx:
67+
name: Trinity Xie
68+
github: hnaoto
Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
---
2+
layout: post
3+
title: "Changing Kafka brokers with zero data loss"
4+
author: trinityx
5+
tags:
6+
- kafka
7+
- msk
8+
- featured
9+
- msk-series
10+
team: Core Platform
11+
---
12+
13+
Migrating streams from one Kafka cluster to another can be challenging, even
14+
more so when the migration is between two major versions, entitwo different
15+
platforms, and from one datacenter to another. We managed to [migrate Kafka to
16+
the Cloud](/blog/2019/migrating-kafka-to-aws.html) on **hard mode**, without
17+
any downtime, and zero data loss. When we were first planning the migration, we
18+
joked that we wanted to avoid a single moment where we would try to swap Kafka
19+
all at once, like Indiana Jones tried in [Raiders of the Lost
20+
Ark](https://www.youtube.com/watch?v=0gU35Tgtlmg). In this post, I will share
21+
more aobut our Kafka migration to AWS MSK, and how we tried to avoid "Indiana
22+
Jones moments."
23+
24+
25+
Our goal was zero downtime and zero data loss. The approach we adopted was a
26+
gradual rollover method, one which worked so well that we will be re-using for
27+
moving Kafka traffic during future migrations.
28+
29+
30+
Since we were talking about two very different Kafka environments, the
31+
changeover was a bit more involved than simply diverting traffic. Our legacy
32+
Kafka environment was version 0.10, the new environment ran 2.10. Our legacy
33+
cluster did not run with authentication, the new cluster required TLS client
34+
certificate-based authentication for producers and consumers. And while our
35+
legacy cluster was colocated in the same on-premise datacenter as the producers
36+
and consumers, the new Kafka cluster was to be hosted in AWS, in a region
37+
connected to the datacenter by a Direct Connect.
38+
39+
Our approach had four phases, which were executed in sequence:
40+
41+
1. **Send test writes to the new cluster**: for each production message written
42+
to the old cluster, a test message would be written to the new cluster. This
43+
helped us verify our capacity plan and configuration.
44+
1. **Double writes to both clusters**: for each production message, writes
45+
would be issued to both clusters simultaneously, allowing downstream
46+
verification that both data streams were operating identically.
47+
1. **Split writes to the new cluster at 10%**: for one of every ten production
48+
messages, it would solely be written to the new cluster.
49+
1. **Split writes to the new cluster at 100%**: every production message would be written to the new environment.
50+
51+
52+
![Indiana Jones](/post-images/2020-01-kafka-series/indiana-jones.gif)
53+
54+
55+
The types of data passing through our clusters is basically analytics data.
56+
Conceptually it goes in "one direction", from the producers to Kafka, from
57+
there consumers retrieve data and store it in our data warehouse. To support
58+
our gradual rollover approach, we needed to first update all our producers and
59+
consumers. Our producers needed to support both sending duplicate messages to
60+
both clusters. The consumers needed to support consuming from multiple
61+
clusters, joining the streams, and persisting the messages in the same (usual)
62+
location in HDFS.
63+
64+
65+
![Simple diagram of an example Spark-based workflow](/post-images/2020-01-kafka-series/kafka-spark.png)
66+
*An example data flow.*
67+
68+
69+
Once our producers and consumers were ready for the migration, we ran our test
70+
writes to verify that our new MSK-based cluster could handle the production
71+
traffic loads. As we expected, it handled everything smoothly! What we did not
72+
expect was the additional latency the second write call introduced. Since API
73+
endpoints needed to complete a synchronous "write" across the datacenter
74+
boundary into AWS before returning a 200 response, we saw the average response
75+
time for our endpoints go 50-80ms, with the 99% percentile taking almost
76+
150-200ms longer! Fortunately, the added latency did not adversely impact this
77+
key-performance indicators (KPIs) of the producer application.
78+
79+
With the test write phase completed, we could finally start the _actual_
80+
migration of data. Although we had done end-to-end testing in our development
81+
environment, the gradual rolover of production was still essential. Therefore
82+
we continued with the next three phases:
83+
84+
1. Double-writes
85+
1. Split-writes
86+
1. Single-writes
87+
88+
89+
90+
## Double-writes
91+
92+
For this phase, the producers need to write each message twice, into both Kafka
93+
clusters. The production consumer remained the same, and continued to read
94+
messages from the old production cluster.
95+
96+
Writing data from all the hosts to both Kafka might seem to be redundant;
97+
however, It was critical to verify that all the hosts could write data into the
98+
new cluster, with its different authentication scheme and different version.
99+
100+
![Double-writes diagram](/post-images/2020-01-kafka-series/kafka-double-writes.png)
101+
102+
103+
## Split-write
104+
105+
After the double-write phase, we were fairly confident about using the new Kaka
106+
cluster. Nonetheless, switching clusters entirely in one shot would have been
107+
unnecessarily risky. We started with a gradual rollover of 10%, which we
108+
decided was an acceptable amount of messages to lose for a couple minutes if
109+
something were to go wrong in that "worst-case scenario."
110+
111+
Our approach was to make it such that 10% of our **hosts** would write data to
112+
the new Kafka cluster. The rest of the hosts still wrote data into the old
113+
cluster. We structured the split on the host-level, rather than on each
114+
individual message to reduce the runtime complexity and potential debugging we
115+
might need to do in production. If something went wrong, we could comb through
116+
the logs of only a couple hosts, or quickly remove them from the load balancer.
117+
118+
To support this phase, the consumers had been deployed with their union logic
119+
enabled, allowing them to combine messages consumed from both clusters before
120+
writing them into long-term storage (HDFS).
121+
122+
![Split-writes diagram](/post-images/2020-01-kafka-series/kafka-split-writes.png)
123+
124+
125+
## Single-write
126+
127+
To complete the migration, all hosts were configured to send their traffic to
128+
the new Kafka cluster, although we left the consumer reading data from both
129+
clusters "just in case."
130+
131+
Only after manual verification that the legacy Kafka cluster was receiving absolutely zero new messages, did we toggle the consumers to read from only one cluster: the new MSK-based Kafka cluster.
132+
133+
---
134+
135+
With 100% of Kafka producers and consumers interacting with the new cluster in
136+
AWS, we were able to tackle the most fun part of the project: decommissioning
137+
the old Kafka 0.10 cluster. We were not able to give it a viking's funeral, but
138+
I would have liked to.
139+
140+
---
141+
142+
## Unexpected Challenges
143+
144+
Zero data loss doesn't mean that there were zero problems along the way. We did
145+
have some unexpected issues along the way. Fortunately no major issues
146+
manifested in production, we only saw a little bit more latency.
147+
148+
149+
While we were testing the consumers with the new Kafka, I noticed some older
150+
jobs, which were running under [Apache Spark](https://spark.apache.org) version
151+
1, simply would not work with an MSK cluster. The version of
152+
`spark-streaming-kafka_2.10` uses the Kafka `SimpleConsumer` internally, which
153+
does not accept Kafka properties or authentication parameters. Since we were
154+
switching to TLS client certificate-based authentication, we had to overcome
155+
this roadblock. As luck had it, we were already planning to upgrade those older
156+
jobs to Spark 2, which includes Spark streaming changes among other
157+
improvements, and the issue was resolved.
158+
159+
160+
We also discovered some issues with the
161+
[ruby-kafka](https://rubygems.org/gems/ruby-kafka) gem. While upgrading older
162+
libraries across producers and consumers to support the newer MSK cluster, we
163+
discovered that newer versions of the gem could not write messages to the _old_
164+
Kafka cluster. This didn't prevent our rollover of Ruby producers, but since
165+
our older gem version could write to both clusters, we simply delayed the
166+
upgrade of the `ruby-kafka` gem until the cluster rollover was complete.
167+
168+
169+
170+
## Final Thoughts
171+
172+
173+
Migrating Kafka clusters can be risky and stressful. Planning is very critical.
174+
Multiple producers and consumers can complicate major upgrades like this one.
175+
We created a central document for everybody to coordinate around, which
176+
captured the details of the different phases, along with the Jira tickets
177+
associated with each task. This proved to be incredibly valuable, and turned
178+
into our overall project dashboard for everybody involved.
179+
180+
In an ideal case, producers and consumers likely have not been changed in a
181+
while. They should carry on their jobs without constant supervision. While this
182+
is a _good_ thing, it also means that there will be some code rot, and
183+
additional upgrades which should be considered along the way.
184+
185+
Finally, strong monitoring _before_ the migration can begin is critical. We had
186+
set up [Datadog](https://datadoghq.com) dashboards for the legacy and future
187+
clusters before we even began the migration. This helped us understand what
188+
"normal" looked like for both clusters before we started making significant
189+
changes.
190+
191+
192+
Rolling over to new Kafka clusters can be time consuming, but I
193+
think the extra time and patience to do it _right_ is worth the effort.
Loading
Loading
48.2 KB
Loading
Loading

0 commit comments

Comments
 (0)