Exactly once delivery is an unexpectedly difficult feat to achieve in distributed event streaming systems.
Background
Diagram
Definitions
- Producer: The service that is sending events into the event stream.
- Event Stream: The service (often kafka) responsible for handling the stream of events.
- Consumer: The service that is reading events out of the event stream.
- Events: Standalone pieces of information. For example:
{action: "User Logged In", user_id: 123, timestamp: "2020-02-10"}
Why would a message be delivered twice?
Two points of context to help explain why exactly-once is hard to guarantee:
- Event streams are distributed systems where at least two services are talking to each other over a network. Networks can fail.
- Message streams regularly handle billions of messages a day. At that scale, “one in a million” edge case happens thousands of times per day.
With that in mind, here are two examples of areas where messages might fail in a way that results in them being delivered twice:
Duplicate inbound messages
If a source, when sending messages into an event stream, gets an “Error” response, should it try again?
- If the error occurred at any point before the event stream saved the message, then yes, the source should re-send it.
- If the error occurred after the event stream saved the message and the source re-sends it we will end up with two events in the destination.
Duplicate outbound messages
Equally important is the guarantee that the same event is not sent to a consumer twice. This can happen under similar conditions as the previous. If an event consumer asks for an event #100 and then crashes, which event should the stream send when the consumer restarts: 100 or 101?
Again it depends on whether the consumer crashed before or after processing the message.
Why exactly once guarantees are important
How exactly once is solved
State of exactly-once in event streaming systems
Kafka
Kinesis