There are a lot of things going on here. Let's take each at a time.
It seems you are using a single container per pod (though you could have many containers per pod). The requests.memory
and limits.memory
are specific to the container, Kubernetes computes limits
and requests
per pod as the sum of all the containers limitations.
So think about it - you are saying that a pod
shows 904.38Mi
, but you show requests.memory
and limits.memory
, which is per container. That is why I assume you have one container per pod. This is a general intro that does not answer your question - but we will get there.
Then comes the fact that a pod
is started by docker
, which is started with kubectl
and that reads requires.memory
and limits.memory
. To make this a bit simpler : what you set in limits.memory
, will be passed as docker -m
. So, in your case, the total memory used for the docker process is 1.5GC
. Remember that is the entire process limit, not just the heap. A java process is a lot more than the heap, which you specify with -Xms512m -Xmx1024m
. So to answer your question:
How did kubernetes arrived at 904.38Mi usage?
This is what the entire process is currently taking, not just the heap. From the very short log files that you posted - your application is just fine.
EDIT
I actually did not had kubernetes dashboard in my environment to test this specifically, so had to install it to really understand what is going on. I had a hint against the majority of things, but to make sure, I did some tests.
First things first: what does that number in the dashboard mean? Took a while to find/understand, but that is the actual resident memory of the process, which actually is a very good thing.
Any sane OS
knows that when someone requests memory from it, it rarely needs/utilizes it all, as such, it gives the memory to it in a lazy fashion. This is easy to prove in k8s
. Suppose I have a jdk-13
JVM and start it with:
kubectl run jdk-13
--image=jdk-13
--image-pull-policy=Never
--limits "memory=100Mi"
--requests "memory=10Mi"
--command -- /bin/sh -c "while true; do sleep 5; done".
Notice requests.memory=10Mi
and limits.memory=100Mi
. Reading the answer from the beginning, you already know that the specific pod will be started with docker -m 100m...
because limits.memory=100Mi
. That is easy to prove, just sh
into the pod
:
kubectl exec -it jdk-13-b8d656977-rpzrg -- /bin/sh
and find out what the cgroup
says:
# cat /sys/fs/cgroup/memory/memory.limit_in_bytes
104857600 // 100MB
perfect! so the pod's memory limitation is 100 MB
max, but what is the current memory utilization, that is what is the resident memory taken?
kubectl top pod
NAME CPU(cores) MEMORY(bytes)
jdk-13-b8d656977-rpzrg 1m 4Mi
OK, so the current memory utilization is only 4MB
.
You can "make sure" that this is indeed accurate, if you do:
kubectl exec -it jdk-13-b8d656977-rpzrg -- /bin/sh
and inside that pod issue:
top -o %MEM
and notice that RES
memory is on par with the one reported via dashboard or kubectl top pod
.
And now let's do a test. Suppose I have this very simple code in that pod:
// run this with: java "-Xlog:gc*=debug" -Xmx100m -Xms20m HeapTest
import java.time.temporal.ChronoUnit;
import java.util.Arrays;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.LockSupport;
public class HeapTest {
public static void main(String[] args) throws Exception {
// allocate 1 MB every 3 seconds
for (int i = 0; i < 40; ++i) {
byte[] b = new byte[1024 * 1024 * 1];
b[i] = 1;
System.out.println(Arrays.hashCode(b));
LockSupport.parkNanos(TimeUnit.of(ChronoUnit.SECONDS).toNanos(3));
}
}
}
I allocate 1MB
every 3 seconds for around 2 minutes. When I look at this process in dashboard, I do see that at some point in time, the memory grows. After the program ends, dashboard reports the drop in memory back. Good! That means the memory is given back and RSS memory drops. This is how it looks like in dashboard:
Now let's change this code just a bit. Let's add some GC in there and let's never finish this process (you know like typical spring-boot applications do):
import java.time.temporal.ChronoUnit;
import java.util.Arrays;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.LockSupport;
public class HeapTest {
public static void main(String[] args) throws Exception {
// allocate 1 MB every 3 seconds
for (int i = 0; i < 40; ++i) {
byte[] b = new byte[1024 * 1024 * 1];
b[i] = 1;
System.out.println(Arrays.hashCode(b));
LockSupport.parkNanos(TimeUnit.of(ChronoUnit.SECONDS).toNanos(3));
}
for (int i = 0; i < 10; i++) {
Thread.sleep(500);
System.gc();
}
while (true) {
try {
Thread.sleep(TimeUnit.of(ChronoUnit.SECONDS).toMillis(5));
Thread.onSpinWait();
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
}
I run this with:
java "-Xlog:heap*=debug"
"-Xlog:gc*=debug"
"-Xlog:ergo*=debug"
-Xmx100m
-Xms20m
HeapTest
on inspecting logs (just like in your example), I do see that heap is collected just fine. But when I look at the dashboard, the memory does not drop (unlikes the previous example).
Once G1GC
takes memory, it is not very eager to give it back to the OS. It can do that in rare cases, here is one example or you can instruct it to do so.
Both ways are rather painful, instead there are GC
algorithms that are smarter (and in general a lot better). My personal love goes to Shenandoah
, let's see what it does. If I slightly change the code (so that I could prove my point better):
import java.time.temporal.ChronoUnit;
import java.util.Arrays;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.LockSupport;
public class HeapTest {
public static void main(String[] args) throws Exception {
// allocate 1/4 MB every 100 ms
for (int i = 0; i < 6000; ++i) {
byte[] b = new byte[1024 * 256];
b[i] = 1;
System.out.println(Arrays.hashCode(b));
LockSupport.parkNanos(TimeUnit.of(ChronoUnit.MILLIS).toNanos(100));
}
while (true) {
try {
Thread.sleep(TimeUnit.of(ChronoUnit.SECONDS).toMillis(5));
Thread.onSpinWait();
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
}
And run it with:
java "-Xlog:gc*=debug"
"-Xlog:ergo*=debug"
"-Xlog:heap*=debug"
-XX:+UnlockExperimentalVMOptions
-XX:+UseShenandoahGC
-XX:+ShenandoahUncommit
-XX:ShenandoahGCHeuristics=compact
-Xmx1g
-Xms1m
HeapTest
Here is what you are going to see:
And you should, for a fact care about this:
This behavior is particularly disadvantageous in container environments where resources are paid by use. Even during phases where the VM only uses a fraction of its assigned memory resources due to inactivity, G1 will retain all of the Java heap. This results in customers paying for all resources all the time, and cloud providers not being able to fully utilize their hardware.
P.S. I would also add to that the fact that other pods are suffering too, because one pod decided to take as much memory as it could, at a particular spike, and never give it back.