Recently, I came across a scenario where, at first glance, we needed an active-active Kafka cluster with bi-directional replication using MirrorMaker2. However, while diving deeper into the nuances of the solution that this particular project demanded, a number of factors showed this wasn’t really the case. For example, a look at how the producers/consumers will handle temporary failure and/or guarantee exactly-once processing downstream added a ton of complexity to the problem.
By taking a step back to the design board, we changed the approach to an active-standby scenario. In this scenario, the active cluster will replicate all its configs, consumer groups, and topics to the standby cluster that will become active in case of failure. However, this posed a problem: how can I get the minimum possible downtime in my producers without the need to restart them with the new configuration for the new cluster?