Hi I'm trying to run a pipeline where I am calculating diffs between messages that are published to pubsub with 30sec heartbeats* (10K streams, each heartbeating every 30sec). I don't care about 100% data completeness, but I'd like to understand what the watermark heuristic is for PubsubIO (and if I can tweak it), to determine whether I can ignore late data with sufficiently low loss.
*Note, the pubsub topic provides [potentially days worth of] persistence in case we have to take down the pipeline so it's important that the heuristic work well with a backlogged subscription.
Can someone explain how the watermark is calculated (assuming timestamplabel() is used), and how it can be adjusted, if at all?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…