Your benchmark does not actually measure anyMatch
performance, but rather a stream overhead. This overhead can appear significant when comparing to a very simple operation like a five-element array lookup.
The slowdown will not look so terrible if we swtich from relative to absolute numbers. Let's measure latency instead of throughput for a clearer picture. I've omitted lambdaIntStream
benchmark since it works exactly the same way as lambdaArrayStream
.
Benchmark Mode Cnt Score Error Units
Contains.lambdaArrayStream avgt 5 53,242 ± 2,034 ns/op
Contains.naive avgt 5 5,876 ± 0,404 ns/op
5.8 ns is roughly 14 cycles of a 2.4 GHz CPU. The workload is so small that any extra cycle will be noticeable. So what is the overhead of stream operations?
Object allocation
Now rerun the benchmark with -prof gc
profiler. It will show the amount of heap allocations:
Benchmark Mode Cnt Score Error Units
Contains.lambdaArrayStream:·gc.alloc.rate.norm avgt 5 152,000 ± 0,001 B/op
Contains.naive:·gc.alloc.rate.norm avgt 5 ≈ 10?? B/op
lambdaArrayStream
allocates 152 bytes per iteration while naive
benchmark allocates nothing. Of course, allocation is not free: there are at least 5 objects constructed to support anyMatch
, and each takes several nanoseconds:
- Lambda
i -> i == val
- IntPipeline.Head
- Spliterators.IntArraySpliterator
- MatchOps.MatchOp
- MatchOps.MatchSink
Call stack
java.util.stream
implementation is a bit complicated since it must support all combinations of stream sources, intermediate and terminal operations. If you look at the call stack of anyMatch
in your benchmark, you'll see something like that:
at bench.Contains.lambda$lambdaArrayStream$0(Contains.java:24)
at java.util.stream.MatchOps$2MatchSink.accept(MatchOps.java:119)
at java.util.Spliterators$IntArraySpliterator.tryAdvance(Spliterators.java:1041)
at java.util.stream.IntPipeline.forEachWithCancel(IntPipeline.java:162)
at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:230)
at java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:196)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.IntPipeline.anyMatch(IntPipeline.java:477)
at bench.Contains.lambdaArrayStream(Contains.java:23)
Not all of these method calls can be inlined. Furthermore, JVM limits inlining to 9 levels, but here we see the deeper call stack. If we override the limit with -XX:MaxInlineLevel=20
the score will become a bit better:
Benchmark Mode Cnt Score Error Units
Contains.lambdaArrayStream avgt 5 33,294 ± 0,367 ns/op (was 53,242)
Contains.naive avgt 5 5,822 ± 0,207 ns/op
Loop optimizations
for
iteration over an array is a trivial counted loop. JVM can apply a wide range of loop optimizatons here: loop peeling, loop unrolling etc. This does not work for a while
-kind loop in forEachWithCancel
method, which is used to traverse an IntStream. The effect of loop optimizations can be measured with -XX:LoopUnrollLimit=0 -XX:-UseLoopPredicate
:
Benchmark Mode Cnt Score Error Units
Contains.lambdaArrayStream avgt 5 33,153 ± 0,559 ns/op
Contains.naive avgt 5 9,853 ± 0,150 ns/op (was 5,876)
Conclusions
There is some overhead to construct and to traverse a stream, but this is completely understood and cannot be considered a bug. I would not say the overhead is big (even 50 ns/op is not that much); however, in this particular example the overhead is dominant because of an extremely tiny workload.