Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
107 views
in Technique[技术] by (71.8m points)

How to "join" a frequently updating stream with an irregularly updating stream in Apache Beam?

I have a stream of measurements keyed by an ID PCollection<KV<ID,Measurement>> and something like a changelog stream of additional information for that ID PCollection<KV<ID,SomeIDInfo>>. New data is added to the measurement stream quite regularly, say once per second for every ID. The stream with additional information on the other hand is only updated when a user performs manual re-configuration. We can't tell often this happens and, in particular, the update frequency may vary among IDs.

My goal is now to enrich each entry in the measurements stream by the additional information for its ID. That is, the output should be something like PCollection<KV<ID,Pair<Measurement,SomeIDInfo>>>. Or, in other words, I would like to do a left join of the measurements stream with the additional information stream.

I would expect this to be a quite common use case. Coming from Kafka Streams, this can be quite easily implemented with a KStream-KTable-Join. With Beam, however, all my approaches so far seem not to work. I already thought about the following ideas.

Idea 1: CoGroupByKey with fixed time windows

Applying a window to the measurements stream would not be an issue. However, as the additional information stream is updating irregularly and also significantly less frequently than the measurements stream, there is no reasonable common window size such that there is at least one updated information for each ID.

Idea 2: CoGroupByKey with global window and as non-default trigger

Refining the previous idea, I thought about using a processing-time trigger, which fires e.g. every 5 seconds. The issue with this idea is that I need to use accumulatingFiredPanes() for the additional information as there might be no new data for a key between two firings, but I have to use discardingFiredPanes() for the measurements stream as otherwise my panes would quickly become too large. This simply does not work. When I configure my pipeline that way, also the additional information stream discards changes. Setting both trigger to accumulating it works, but, as I said, this is not scalable.

Idea 3: Side inputs

Another idea would be to use side inputs, but also this solution is not really scalable - at least if I don't miss something. With side inputs, I would create a PCollectionView from the additional information stream, which is a map of IDs to the (latest) additional information. The "join" can than be done in a DoFn with a side input of that view. However, the view seems to be shared by all instances that perform the side input. (It's a bit hard to find any information regarding this.) We would like to not make any assumptions regarding the amount of IDs and the size of additional info. Thus, using a side input seems also not to work here.


question from:https://stackoverflow.com/questions/66067263/how-to-join-a-frequently-updating-stream-with-an-irregularly-updating-stream-i

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The side input option you discuss is currently the best option, although you are correct about the scalability concern due to the side input being broadcast to all workers.

Alternatively, you can store the infrequently-updated side in an external key-value store and just do lookups from a DoFn. If you go this route, it's generally useful to do a GroupByKey first on the main input with ID as a key, which lets you cache the lookups with a good cache-hit ratio.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...