Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.4k views
in Technique[技术] by (71.8m points)

kafka consumer api - Apache Beam, KafkaIO at least once semantics

We are implementing a pilot that reads from Kafka and writes to BigQuery.

Simple pipeline:

  • KafkaIO.read
  • BigQueryIO.write

We switched off the auto-commit. And we are using?commitOffsetsInFinalize()

Can this setup guarantee that message will appear at least once in BigQuery and will not be lost, provided that everything is ok on the BigQueryIO side?

In the documentation for commitOffsetsInFinalize() I've met the following: ?

It helps with minimizing gaps or duplicate processing of records while restarting a pipeline from scratch

I'm curious?what "gaps" here are being?referred to?

If you consider the edge cases, is there a possibility of messages being skipped and not delivered to BQ?

question from:https://stackoverflow.com/questions/65951040/apache-beam-kafkaio-at-least-once-semantics

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Committing the offset for Apache Kafka means that if you were to restart your pipeline, it would start at the position within the stream before you restarted. Dataflow does have a guarantee that data won't be dropped when writing to BigQuery. But, using a distributed system, there is always a potential for something to go wrong (for example, a GCP outage).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...