Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
369 views
in Technique[技术] by (71.8m points)

garbage collection - How to get Java stacks when JVM can't reach a safepoint

We recently had a situation where one of our production JVMs would randomly freeze. The Java process was burning CPU, but all visible activity would cease: no log output, nothing written to the GC log, no response to any network request, etc. The process would persist in this state until restarted.

It turned out that the org.mozilla.javascript.DToA class, when invoked on certain inputs, will get confused and call BigInteger.pow with enormous values (e.g. 5^2147483647), which triggers the JVM freeze. My guess is that some large loop, perhaps in java.math.BigInteger.multiplyToLen, was JIT'ed without a safepoint check inside the loop. The next time the JVM needed to pause for garbage collection, it would freeze, because the thread running the BigInteger code wouldn't be reaching a safepoint for a very long time.

My question: in the future, how can I diagnose a safepoint problem like this? kill -3 didn't produce any output; I presume it relies on safepoints to generate accurate stacks. Is there any production-safe tool which can extract stacks from a running JVM without waiting for a safepoint? (In this case, I got lucky and managed to grab a set of stack traces just after BigInteger.pow was invoked, but before it worked its way up to a sufficiently large input to completely wedge the JVM. Without that stroke of luck, I'm not sure how we would ever have diagnosed the problem.)

Edit: the following code illustrates the problem.

// Spawn a background thread to compute an enormous number.
new Thread(){ @Override public void run() {
  try {
    Thread.sleep(5000);
  } catch (InterruptedException ex) {
  }
  BigInteger.valueOf(5).pow(100000000);
}}.start();

// Loop, allocating memory and periodically logging progress, so illustrate GC pause times.
byte[] b;
for (int outer = 0; ; outer++) {
  long startMs = System.currentTimeMillis();
  for (int inner = 0; inner < 100000; inner++) {
    b = new byte[1000];
  }

  System.out.println("Iteration " + outer + " took " + (System.currentTimeMillis() - startMs) + " ms");
}

This launches a background thread which waits 5 seconds and then starts an enormous BigInteger computation. In the foreground, it then repeatedly allocates a series of 100,000 1K blocks, logging the elapsed time for each 100MB series. During the 5 second period, each 100MB series runs in about 20 milliseconds on my MacBook Pro. Once the BigInteger computation begins, we begin to see long pauses interleaved. In one test, the pauses were successively 175ms, 997ms, 2927ms, 4222ms, and 22617ms (at which point I aborted the test). This is consistent with BigInteger.pow() invoking a series of ever-larger multiply operations, each taking successively longer to reach a safepoint.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Your problem interested me very much. You were right about JIT. First I tried to play with GC types, but this did not have any effect. Then I tried to disable JIT and everything worked fine:

java -Djava.compiler=NONE Tests

Then printed out JIT compilations:

java -XX:+PrintCompilation Tests

And noticed that problem starts after some compilations in BigInteger class, I tried to exclude methods one by one from compilation and finally found the cause:

java -XX:CompileCommand=exclude,java/math/BigInteger,multiplyToLen -XX:+PrintCompilation Tests

For large arrays this method could work long, and problem might really be in safepoints. For some reason they are not inserted, but should be even in compiled code. Looks like a bug. The next step should be to analyze assembly code, I did not do it yet.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...