Flink Kafka->S3 exactly once guarantees

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Flink Kafka->S3 exactly once guarantees

amran dean

Suppose I am using a nondeterministic time based partitioning scheme (e.g Flink processing time) to bucket S3 objects via the BucketAssigner, designated using BulkFormatBuilder for StreamingFileSink.

Suppose that after an S3 MPU has completed, but before Flink internally commits (whether via ZK, or committing to Kafka directly) the newest offsets, the job crashes, losing the committed offsets.

If using processing time to bucket S3 objects, will this result in duplicate objects being written? 
For example:
S3 object with timestamp 10-25 written, but before offsets committed, job crashes.
When job resumes, system time is now 10-26, so instead of overwriting the existing S3 object, a new one is created, duplicating data.

How does Flink prevent this?