I am ingesting data via pub/sub to a dataflow pipeline which is running in unbounded mode. The data are basically coordinates with timestamps captured from tracking devices. Those messages arrive in batches, where each batch might be 1..n messages. For a certain period there might be no messages arriving, which might be resent later on (or not). We use the time-stamp (in UTC) of each coordinate as an attribute for the pub-sub message. And read the pipeline via a Timestamp label:
pipeline.apply(PubsubIO.Read.topic("new").timestampLabel("timestamp")
An example of coordinates and delay looks like:
36 points wait 0:02:24
36 points wait 0:02:55
18 points wait 0:00:45
05 points wait 0:00:01
36 points wait 0:00:33
36 points wait 0:00:43
36 points wait 0:00:34
A message might look like:
2013-07-07 09:34:11;47.798766;13.050133
After the first batch the Watermark is empty, after the second batch I can see a Watermark in the Pipeline diagnostics, just it doesn't get updated, although new messages arrive. Also according to stackdriver logging PubSub has no undelivered or unacknowledged messages.
Shouldn't the watermark move forward as messages with new event time arrive?
According to What is the watermark heuristic for PubsubIO running on GCD? the WaterMark should also move forward every 2minutes which it doesn't?
[..] In the case that we have not seen data on the subscription in more
than two minutes (and there's no backlog), we advance the watermark to
near real time. [..]
Update to address Bens questions:
Is there a job ID that we could look at?
Yes I just restarted the whole setup at 09:52 CET which is 07:52 UTC, with job ID 2017-05-05_00_49_11-11176509843641901704.
What version of the SDK are you using?
1.9.0
How are you publishing the messages with the timestamp labels?
We use a python script to publish the data which is using the pub sub sdk.
A message from there might look like:
{'data': {timestamp;lat;long;ele}, 'timestamp': '2017-05-05T07:45:51Z'}
We use the timestamp attribute for the timestamplabel in dataflow.
What is the watermark stuck at?
For this job the watermark is now stuck at 09:57:35 (I am posting this around 10:10), although new data is sent e.g. at
10:05:14
10:05:43
10:06:30
I can also see that it may happen that we publish data to pub sub with delay of more than 10 seconds e.g. at 10:07:47 we publish data with a highest timestamp of 10:07:26.
After a few hours the watermark catches up but I cannot see why it is delayed /not moving in the beginning.
See Question&Answers more detail:
os