|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Changing Kafka brokers with zero data loss" |
| 4 | +author: trinityx |
| 5 | +tags: |
| 6 | +- kafka |
| 7 | +- msk |
| 8 | +- featured |
| 9 | +- msk-series |
| 10 | +team: Core Platform |
| 11 | +--- |
| 12 | + |
| 13 | +Migrating streams from one Kafka cluster to another can be challenging, even |
| 14 | +more so when the migration is between two major versions, entitwo different |
| 15 | +platforms, and from one datacenter to another. We managed to [migrate Kafka to |
| 16 | +the Cloud](/blog/2019/migrating-kafka-to-aws.html) on **hard mode**, without |
| 17 | +any downtime, and zero data loss. When we were first planning the migration, we |
| 18 | +joked that we wanted to avoid a single moment where we would try to swap Kafka |
| 19 | +all at once, like Indiana Jones tried in [Raiders of the Lost |
| 20 | +Ark](https://www.youtube.com/watch?v=0gU35Tgtlmg). In this post, I will share |
| 21 | +more aobut our Kafka migration to AWS MSK, and how we tried to avoid "Indiana |
| 22 | +Jones moments." |
| 23 | + |
| 24 | + |
| 25 | +Our goal was zero downtime and zero data loss. The approach we adopted was a |
| 26 | +gradual rollover method, one which worked so well that we will be re-using for |
| 27 | +moving Kafka traffic during future migrations. |
| 28 | + |
| 29 | + |
| 30 | +Since we were talking about two very different Kafka environments, the |
| 31 | +changeover was a bit more involved than simply diverting traffic. Our legacy |
| 32 | +Kafka environment was version 0.10, the new environment ran 2.10. Our legacy |
| 33 | +cluster did not run with authentication, the new cluster required TLS client |
| 34 | +certificate-based authentication for producers and consumers. And while our |
| 35 | +legacy cluster was colocated in the same on-premise datacenter as the producers |
| 36 | +and consumers, the new Kafka cluster was to be hosted in AWS, in a region |
| 37 | +connected to the datacenter by a Direct Connect. |
| 38 | + |
| 39 | +Our approach had four phases, which were executed in sequence: |
| 40 | + |
| 41 | +1. **Send test writes to the new cluster**: for each production message written |
| 42 | + to the old cluster, a test message would be written to the new cluster. This |
| 43 | + helped us verify our capacity plan and configuration. |
| 44 | +1. **Double writes to both clusters**: for each production message, writes |
| 45 | + would be issued to both clusters simultaneously, allowing downstream |
| 46 | + verification that both data streams were operating identically. |
| 47 | +1. **Split writes to the new cluster at 10%**: for one of every ten production |
| 48 | + messages, it would solely be written to the new cluster. |
| 49 | +1. **Split writes to the new cluster at 100%**: every production message would be written to the new environment. |
| 50 | + |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | + |
| 55 | +The types of data passing through our clusters is basically analytics data. |
| 56 | +Conceptually it goes in "one direction", from the producers to Kafka, from |
| 57 | +there consumers retrieve data and store it in our data warehouse. To support |
| 58 | +our gradual rollover approach, we needed to first update all our producers and |
| 59 | +consumers. Our producers needed to support both sending duplicate messages to |
| 60 | +both clusters. The consumers needed to support consuming from multiple |
| 61 | +clusters, joining the streams, and persisting the messages in the same (usual) |
| 62 | +location in HDFS. |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | +*An example data flow.* |
| 67 | + |
| 68 | + |
| 69 | +Once our producers and consumers were ready for the migration, we ran our test |
| 70 | +writes to verify that our new MSK-based cluster could handle the production |
| 71 | +traffic loads. As we expected, it handled everything smoothly! What we did not |
| 72 | +expect was the additional latency the second write call introduced. Since API |
| 73 | +endpoints needed to complete a synchronous "write" across the datacenter |
| 74 | +boundary into AWS before returning a 200 response, we saw the average response |
| 75 | +time for our endpoints go 50-80ms, with the 99% percentile taking almost |
| 76 | +150-200ms longer! Fortunately, the added latency did not adversely impact this |
| 77 | +key-performance indicators (KPIs) of the producer application. |
| 78 | + |
| 79 | +With the test write phase completed, we could finally start the _actual_ |
| 80 | +migration of data. Although we had done end-to-end testing in our development |
| 81 | +environment, the gradual rolover of production was still essential. Therefore |
| 82 | +we continued with the next three phases: |
| 83 | + |
| 84 | +1. Double-writes |
| 85 | +1. Split-writes |
| 86 | +1. Single-writes |
| 87 | + |
| 88 | + |
| 89 | + |
| 90 | +## Double-writes |
| 91 | + |
| 92 | +For this phase, the producers need to write each message twice, into both Kafka |
| 93 | +clusters. The production consumer remained the same, and continued to read |
| 94 | +messages from the old production cluster. |
| 95 | + |
| 96 | +Writing data from all the hosts to both Kafka might seem to be redundant; |
| 97 | +however, It was critical to verify that all the hosts could write data into the |
| 98 | +new cluster, with its different authentication scheme and different version. |
| 99 | + |
| 100 | + |
| 101 | + |
| 102 | + |
| 103 | +## Split-write |
| 104 | + |
| 105 | +After the double-write phase, we were fairly confident about using the new Kaka |
| 106 | +cluster. Nonetheless, switching clusters entirely in one shot would have been |
| 107 | +unnecessarily risky. We started with a gradual rollover of 10%, which we |
| 108 | +decided was an acceptable amount of messages to lose for a couple minutes if |
| 109 | +something were to go wrong in that "worst-case scenario." |
| 110 | + |
| 111 | +Our approach was to make it such that 10% of our **hosts** would write data to |
| 112 | +the new Kafka cluster. The rest of the hosts still wrote data into the old |
| 113 | +cluster. We structured the split on the host-level, rather than on each |
| 114 | +individual message to reduce the runtime complexity and potential debugging we |
| 115 | +might need to do in production. If something went wrong, we could comb through |
| 116 | +the logs of only a couple hosts, or quickly remove them from the load balancer. |
| 117 | + |
| 118 | +To support this phase, the consumers had been deployed with their union logic |
| 119 | +enabled, allowing them to combine messages consumed from both clusters before |
| 120 | +writing them into long-term storage (HDFS). |
| 121 | + |
| 122 | + |
| 123 | + |
| 124 | + |
| 125 | +## Single-write |
| 126 | + |
| 127 | +To complete the migration, all hosts were configured to send their traffic to |
| 128 | +the new Kafka cluster, although we left the consumer reading data from both |
| 129 | +clusters "just in case." |
| 130 | + |
| 131 | +Only after manual verification that the legacy Kafka cluster was receiving absolutely zero new messages, did we toggle the consumers to read from only one cluster: the new MSK-based Kafka cluster. |
| 132 | + |
| 133 | +--- |
| 134 | + |
| 135 | +With 100% of Kafka producers and consumers interacting with the new cluster in |
| 136 | +AWS, we were able to tackle the most fun part of the project: decommissioning |
| 137 | +the old Kafka 0.10 cluster. We were not able to give it a viking's funeral, but |
| 138 | +I would have liked to. |
| 139 | + |
| 140 | +--- |
| 141 | + |
| 142 | +## Unexpected Challenges |
| 143 | + |
| 144 | +Zero data loss doesn't mean that there were zero problems along the way. We did |
| 145 | +have some unexpected issues along the way. Fortunately no major issues |
| 146 | +manifested in production, we only saw a little bit more latency. |
| 147 | + |
| 148 | + |
| 149 | +While we were testing the consumers with the new Kafka, I noticed some older |
| 150 | +jobs, which were running under [Apache Spark](https://spark.apache.org) version |
| 151 | +1, simply would not work with an MSK cluster. The version of |
| 152 | +`spark-streaming-kafka_2.10` uses the Kafka `SimpleConsumer` internally, which |
| 153 | +does not accept Kafka properties or authentication parameters. Since we were |
| 154 | +switching to TLS client certificate-based authentication, we had to overcome |
| 155 | +this roadblock. As luck had it, we were already planning to upgrade those older |
| 156 | +jobs to Spark 2, which includes Spark streaming changes among other |
| 157 | +improvements, and the issue was resolved. |
| 158 | + |
| 159 | + |
| 160 | +We also discovered some issues with the |
| 161 | +[ruby-kafka](https://rubygems.org/gems/ruby-kafka) gem. While upgrading older |
| 162 | +libraries across producers and consumers to support the newer MSK cluster, we |
| 163 | +discovered that newer versions of the gem could not write messages to the _old_ |
| 164 | +Kafka cluster. This didn't prevent our rollover of Ruby producers, but since |
| 165 | +our older gem version could write to both clusters, we simply delayed the |
| 166 | +upgrade of the `ruby-kafka` gem until the cluster rollover was complete. |
| 167 | + |
| 168 | + |
| 169 | + |
| 170 | +## Final Thoughts |
| 171 | + |
| 172 | + |
| 173 | +Migrating Kafka clusters can be risky and stressful. Planning is very critical. |
| 174 | +Multiple producers and consumers can complicate major upgrades like this one. |
| 175 | +We created a central document for everybody to coordinate around, which |
| 176 | +captured the details of the different phases, along with the Jira tickets |
| 177 | +associated with each task. This proved to be incredibly valuable, and turned |
| 178 | +into our overall project dashboard for everybody involved. |
| 179 | + |
| 180 | +In an ideal case, producers and consumers likely have not been changed in a |
| 181 | +while. They should carry on their jobs without constant supervision. While this |
| 182 | +is a _good_ thing, it also means that there will be some code rot, and |
| 183 | +additional upgrades which should be considered along the way. |
| 184 | + |
| 185 | +Finally, strong monitoring _before_ the migration can begin is critical. We had |
| 186 | +set up [Datadog](https://datadoghq.com) dashboards for the legacy and future |
| 187 | +clusters before we even began the migration. This helped us understand what |
| 188 | +"normal" looked like for both clusters before we started making significant |
| 189 | +changes. |
| 190 | + |
| 191 | + |
| 192 | +Rolling over to new Kafka clusters can be time consuming, but I |
| 193 | +think the extra time and patience to do it _right_ is worth the effort. |
0 commit comments