Memory Management in Java: Garbage Collection Tuning and Optimization
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
Tuning garbage collection is an important task in designing scalable applications in Java. This is similar to general performance-tuning activity.
Working with the 200 GC-related JVM parameters to optimize the application code will make little difference. Instead, following a simple process will guarantee progress. General steps in the process are:
- State your performance goals
- Run tests
- Measure the results
- Compare the results with the goals
- If goals are not met, make a change and go back to running tests
So, as the first step we need to set clear performance goals in regards of Garbage Collection. The goals fall into three categories, which are common to all performance monitoring and management topics:
- Latency
- Throughput
- Capacity
After explaining the concepts in general, we will demonstrate how to apply these goals in the context of Garbage Collection. If you are already familiar with the concepts of latency, throughput and capacity, you may decide to skip the next section.
General Concepts
Let us begin with an example from manufacturing by observing an assembly line in a factory. The line is assembling Tesla Cars from ready-made components. Tesla Model X are built on this line in a sequential manner. Monitoring the line in action, we measure that it takes four hours to complete a car from the moment the frame enters the line until the assembled car leaves the line at the other end.
Continuing our observations we can also see that one car is assembled after each minute, 24 hours a day, every day. Simplifying the example and ignoring maintenance windows, we can calculate that in any given hour such an assembly line assembles 60 Tesla Cars.
Equipped with these two measurements, we now possess crucial information about the current performance of the assembly line in regards of latency and throughput:
- Latency of the assembly line: 4 hours
- Throughput of the assembly line: 60 cars/hour
Notice that latency is measured in time units. Throughput of a system is measured in completed operations per time unit.
The demand for the Tesla Cars has been steady for a while and the assembly line has kept producing cars with the latency of four hours and the throughput of 60 cars/hour for months.
Now the demand for the bikes suddenly doubles. Instead of the usual 60 * 24 = 1,440 cars/day the customers demand twice as many. The performance of the factory is no longer satisfactory and something needs to be done.
Elon Musk correctly concludes that the latency of the system is not a concern – instead he should focus on the total number of cars produced per day. Coming to this conclusion and assuming he is well funded through the sales of PayPal, Elon would immediately take the necessary steps to improve throughput by adding capacity.
As a result we would now be observing not one but two identical assembly lines in the same factory. Both of these assembly lines would be assembling the very same car every minute of every day. By doing this, Tesla has doubled the number of bikes produced per day. Instead of the 1,440 cars the factory is now capable of shipping 2,880 cars each day. It is important to note that we have not reduced the time to complete an individual car as it still takes four hours to complete a car from start to finish.
In the example above a performance optimization task was carried out, coincidentally impacting both throughput and capacity. As in any good example we started by measuring the system’s current performance, then set a new target and optimized the system only in the aspects required to meet the target.
In this example an important decision was made – the focus was on increasing throughput, not on reducing latency. While increasing the throughput, we also needed to increase the capacity of the system. Instead of a single assembly line we now needed two assembly lines to produce the required quantity. So in this case the added throughput was not free, the solution needed to be scaled out in order to meet the increased throughput requirement.
An important alternative should also be considered for the performance problem at hand. The seemingly non-related latency of the system actually hides a different solution to the problem. If the latency of the assembly line could have been reduced from 1 minute to 30 seconds, the very same increase of throughput would suddenly be possible without any additional capacity.
Whether or not reducing latency was possible or economical in this case is not relevant. What is important is a concept very similar to software engineering – you can almost always choose between two solutions to a performance problem. You can either throw more hardware towards the problem or spend time improving the poorly performing code.
Latency
Latency goals for the GC have to be derived from generic latency requirements. Generic latency requirements are typically expressed in a form similar to the following:
- All user transactions must respond in less than 10 seconds
- 90% of the invoice payments must be carried out in under 3 seconds
- Recommended products must be rendered to a purchase screen in less than 100 ms
When facing performance goals similar to the above, we would need to make sure that the duration of GC pauses during the transaction does not contribute too much to violating the requirements. “Too much” is application-specific and needs to take into account other factors contributing to latency including round-trips to external data sources, lock contention issues and other safe points among these.
Let us assume our performance requirements state that 90% of the transactions to the application need to complete under 1,000 ms and no transaction can exceed 10,000 ms. Out of those generic latency requirements let us again assume that GC pauses cannot contribute more than 10%. From this, we can conclude that 90% of GC pauses have to complete under 100 ms, and no GC pause can exceed 1,000 ms. For simplicity’s sake let us ignore in this example multiple pauses that can occur during the same transaction.
Having formalized the requirement, the next step is to measure pause durations. There are many tools for the job. Let us use GC logs, namely for the duration of GC pauses. The information required is present in different log snippets so let us take a look which parts of date/time data are actually relevant, using the following example:
2015-06-04T13:34:16.974-0200: 2.578: [Full GC (Ergonomics) [PSYoungGen: 93677K->70109K(254976K)] [ParOldGen: 499597K->511230K(761856K)] 593275K->581339K(1016832K), [Metaspace: 2936K->2936K(1056768K)], 0.0713174 secs] [Times: user=0.21 sys=0.02, real=0.07 secs
The event stopped the application threads for 0.0713174 seconds. Even though it took 210 ms of CPU times on multiple cores, the important number for us to measure is the total stop time for application threads, which in this case, where parallel GC was used on a multi-core machine, is equal to a bit more than 70 ms. This specific GC pause is thus well under the required 100 ms threshold and fulfils both requirements.
Extracting information similar to the example above from all GC pauses, we can aggregate the numbers and see whether or not we are violating the set requirements for any of the pause events triggered.
Throughput
Throughput requirements are different from latency requirements. The only similarity that the throughput requirements share with latency is the fact that again, these requirements need to be derived from generic throughput requirements. Generic requirements for throughput can be similar to the following:
- The solution must be able to process 1,000,000 invoices/day
- The solution must support 1,000 authenticated users each invoking one of the functions A, B or C every five to ten seconds
- Weekly statistics for all customers have to be composed in no more than six hours each Sunday night between 12 PM and 6 AM
So, instead of setting requirements for a single operation, the requirements for throughput specify how many operations the system must process in a given time unit.
The GC tuning part now requires determining the total time that can be spent on GC during the time measured. How much is tolerable for the particular system is application-specific, but as a rule of thumb, anything over 10% would look suspicious.
Let us now assume that the requirement at hand foresees that the system processes 1,000 transactions per minute. Let us also assume that the total duration of GC pauses during any minute cannot exceed six seconds (or 10%) of this time.
Having formalized the requirements, the next step would be to harvest the information we need. The source to be used in the example is GC logs, from which we would get information similar to the following:
2015-06-04T13:34:16.974-0200: 2.578: [Full GC (Ergonomics) [PSYoungGen: 93677K 70109K(254976K)] [ParOldGen: 499597K->511230K(761856K)] 593275K->581339K(1016832K), [Metaspace: 2936K->2936K(1056768K)], 0.0713174 secs] [Times: user=0.21 sys=0.02, real=0.07 secs
This time we are interested in user and system times instead of real time. In this case we should focus on 23 milliseconds (21 + 2 ms in user and system times) during which the particular GC pause kept CPUs busy. Even more important is the fact that the system was running on a multi-core machine, translating to the actual stop-the-world pause of 0.0713174 seconds, which is the number to be used in the following calculations.
Extracting the information similar to the above from the GC logs across the test period, all that is left to be done is to verify the total duration of the stop-the-world pauses during each minute. In case the total duration of the pauses does not exceed 6,000ms or six seconds in any of these one-minute periods, we have fulfilled our requirement.
Capacity
Capacity requirements put additional constraints on the environment where the throughput and latency goals can be met. These requirements might be expressed either in terms of computing resources or in cold hard cash. The ways in which such requirements can be described can take the following form:
- The system must be deployed on Android devices with less than 512 MB of memory
- The system must be deployed on Amazon EC2 The maximum required instance size must not exceed the configuration c3.xlarge (8 G, 4 cores)
- The monthly invoice from Amazon EC2 for running the system must not exceed $12,000
Thus, capacity has to be taken into account when fulfilling the latency and throughput requirements. With unlimited computing power, any kind of latency and throughput targets could be met, yet in the real world the budget and other constraints have a tendency to set limits on the resources one can use.
Example
Consider the following code snippet in Java:
import java.util.*;
public class Producer implements Runnable {
private static ScheduledExecutorService executorService = Executors.newScheduledThreadPool(2);
private Deque<byte[]> deque;
private int objectSize;
private int queueSize;
public Producer(int objectSize, int ttl) {
this.deque = new ArrayDeque<byte[]>();
this.objectSize = objectSize;
this.queueSize = ttl * 1000;
}
@Override
public void run() {
for (int i = 0; i < 100; i++) {
deque.add(new byte[objectSize]);
if (deque.size() > queueSize)
{
deque.poll();
}
}
}
public static void main(String[] args) throws InterruptedException {
executorService.scheduleAtFixedRate(new Producer(200 * 1024 * 1024 / 1000, 5), 0, 100, TimeUnit.MILLISECONDS);
executorService.scheduleAtFixedRate(new Producer(50 * 1024 * 1024 / 1000, 120), 0, 100, TimeUnit.MILLISECONDS);
TimeUnit.MINUTES.sleep(10);
executorService.shutdownNow();
}
}
The code submits two jobs to run every 100 ms. Each job emulates objects with a specific lifespan: it creates objects, lets them leave for a predetermined amount of time and then forgets about them, allowing GC to reclaim the memory.
When running the example with GC logging turned on with the following parameters:
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
we immediately see the impact of GC in the log files, similarly to the following:
2015-06-04T13:34:16.119-0200:
1 723: GC (Allocation Failure) PSYoungGen:
2 114016K 73191K(234496K) 421540K-421269K(745984K)
3 0.0858176 secs Times: user=0.04 sys=0.06, real=0.09 secs
2015-06-04T13:34:16.738-0200:
1 2.342: GC (Allocation Failure)
2 PSYoungGen: 234462K-93677K(254976K) 582540K-593275K(766464K)
3 0.2357086 secs Times: user=0.11 sys=0.14, real=0.24 secs
2015-06-04T13:34:16.974-0200:
1 2.578: Full GC (Ergonomics)
2 PSYoungGen: 93677K-70109K(254976K)
3 ParOldGen: 499597K-511230K(761856K) 593275K-581339K(1016832K)
4 Metaspace: 2936K-2936K(1056768K)
5 0.0713174 secs Times: user=0.21 sys=0.02, real=0.07 secs
Based on the information in the log we can start improving the situation with three different goals in mind:
- Making sure the worst-case GC pause does not exceed a predetermined threshold
- Making sure the total time during which application threads are stopped does not exceed a predetermined threshold
- Reducing infrastructure costs while making sure we can still achieve reasonable latency and/or throughput targets
For this, the code above was run for 10 minutes on three different configurations providing three very different results summarized in the following table:
Heap | GC Algorithm | Useful work | Longest pause |
---|---|---|---|
-Xmx12g | -XX:+UseConcMarkSweepGC | 89.8% | 560 ms |
-Xmx12g | -XX:+UseParallelGC | 91.5% | 1,104 ms |
-Xmx8g | -XX:+UseConcMarkSweepGC | 66.3% | 1,610 ms |
The experiment ran the same code with different GC algorithms and different heap sizes to measure the duration of Garbage Collection pauses with regards to latency and throughput. Details of the experiments and an interpretation of the results are given in the following chapters.
Tuning for Latency
Let us assume we have a requirement stating that all jobs must be processed in under 1,000 ms. Knowing that the actual job processing takes just 100 ms we can simplify and deduct the latency requirement for individual GC pauses. Our requirement now states that no GC pause can stop the application threads for longer than 900 ms. Answering this question is easy, one would just need to parse the GC log files and find the maximum pause time for an individual GC pause.
We can see that there is one configuration that already matches this requirement. Running the code with:
java -Xmx12g -XX:+UseConcMarkSweepGC Producer
results in a maximum GC pause of 560 ms, which nicely passes the 900 ms threshold set for satisfying the latency requirement. If neither the throughput nor the capacity requirements are violated, we can conclude that we have fulfilled our GC tuning task and can finish the tuning exercise.
Tuning for Throughput
Let us assume that we have a throughput goal to process 13,000,000 jobs/hour.
Running the configuration as:
java -Xmx12g -XX:+UseParallelGC Producer
We can see that the CPUs are blocked by GC for 8.5% of the time, leaving 91.5% of the computing power for useful work. For simplicity’s sake we will ignore other safe points in the example. Now we have to take into account that:
- One job is processed in 100 ms by a single core
- Thus, in one minute, 60,000 jobs could be processed by one core
- In one hour, a single core could thus process 3.6 M jobs
- We have four cores available, which could thus process 4 x 3.6 M = 14.4 M jobs in an hour
With this amount of theoretical processing power we can make a simple calculation and conclude that during one hour we can in reality process 91.5% of the 14.4 M theoretical maximum resulting in 13,176,000 processed jobs/hour, fulfilling our requirement.
It is important to note that if we simultaneously needed to fulfill the latency requirements set in the previous section, we would be in trouble, as the worst-case latency for this case is close to two times of the previous configuration. This time the longest GC pause on record was blocking the application threads for 1,104 ms.
Tuning for Capacity
Let us assume we have to deploy our solution to the commodity-class hardware with up to four cores and 10 G RAM available. From this we can derive our capacity requirement that the maximum heap space for the application cannot exceed 8 GB.
The application is able to run on this configuration as
java -Xmx8g -XX:+UseConcMarkSweepGC Producer
but both the latency and especially throughput numbers fall drastically:
- GC now blocks CPUs from doing useful work a lot more, as this configuration only leaves 66.3% of the CPUs for useful work. As a result, this configuration would drop the throughput from the best-case-scenario of 13,176,000 jobs/hour to a meager 9,547,200 jobs/hour
- Instead of 560 ms we are now facing 1,610 ms of added latency in the worst case
Walking through the three dimensions it is indeed obvious that you cannot just optimize for “performance” but instead need to think in three different dimensions, measuring and tuning both latency and throughput, and taking the capacity constraints into account.
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.