Tech Tip — Fast Java (2012)

Feb 2018.  The bar for Java code can be set very high
The programmer has control over aspects that will impact performance more than any other factor he may consider out of his control (compiled/interpreted language, etc). Changes to the garbage collector settings (also addressed) and the hardware configuration, while also critical, are secondary to improvements in the code itself, first and foremost. Code considerations are addressed in Chapter 1, followed by optimal garbage collector settings and a strategy for eliminating garbage collector pauses altogether.


Chapter 1 – The Code


This chapter covers improvements that are effective regardless of what garbage collector configuration is used, regardless of hardware, and regardless of whether the application is a client or server. The goal is to identify and remove problems rooted in the code itself and to remove garbage collector abuse.

1.1 Avoid overriding finalize()

Work very hard to avoid overriding finalize(). If finalize() is required, it should be for a class with a low instantiation count, the lower the better. Expired objects with an overridden finalize cannot be garbage collected in the first pass. Objects persisting in memory risk being promoted to long term memory where cleaning them is more expensive.

1.2 Engineer calls to System.gc()

The general rule is to never call System.gc(), a request for a full garbage collection. The exception to that rule is quite useful—make calls to System.gc() when a pause can be tolerated and a cleaning beneficial. For example, if your application can tolerate a small pause after an initialization phase and before becoming “available” to the outside world, consider calling System.gc() at that time to give your application a clean start. Additionally, if your application is not restarted daily, consider the next best thing— scheduling a call to System.gc() during a known slow time of the day or during a low impact moment that can be determined programmatically.

Calls to System.gc() should be very carefully planned and be few and far between. If a third party library you are using is making interruptive calls to System.gc() over which you have no control and you cannot work around, consider finding another third party library.

1.3 Cache and pool

The popular recommendation is not to use caching or pooling—let the garbage collector take care of the objects—yet once again the exception to the rule is extremely useful.

For the purposes of this discussion here, cached objects are not returned after use, while pooled objects are. I prefer that simple distinction to one I have seen based on immutability. While immutability can be part of the discussion, immutability can play a role in both caches and pools, leading to confusion… not that it matters tremendously. Again, caching is for objects that do not have to be returned; pooled objects do need to be returned to the pool.

Following are several scenarios where caching and pooling are beneficial.

Cache common immutables. Even though the JVM is very good at cleaning short-lived objects out of memory, a caching strategy for often-created objects can help lessen the garbage collector load and push a finite set of such objects to long term memory just once.

The Java layers themselves use caching. Consider the code for Integer.valueOf():
   			
static final Integer cache[] = new Integer[-(-128) + 127 + 1];

static 
{
   for(int i = 0; i < cache.length; i++)
      cache[i] = new Integer(i - 128);
}

public static Integer valueOf(int i) 
{
   final int offset = 128;
   if (i >= -128 && i <= 127)
      return IntegerCache.cache[i + offset];
   return new Integer(i);
}
			


A similar caching scheme can also be found in several of the other numeric classes. In fact, the entire range of possible Byte values is cached. Analyze your code and identify areas where caching your own objects might be useful. Also consider expanding Integer caching to a different range appropriate for commonly used values in your application.

Pool instantiation-expensive objects. Often-created objects with a high instantiation cost (in time or memory) are also good candidates for pooling. They can be instantiated just once, retrieved from the pool when needed, and returned to the pool. Database connections, java.util.zip.Inflater, and java.util.zip.Deflater are just a few good examples.

Pool direct ByteBuffer. There may be other external factors which a pool can address. For example, pooling can remedy problems specific to direct ByteBuffer usage.

ByteBuffer.allocatedDirect() will allocate memory from native space outside the JVM heap memory. The native memory limit can be managed by setting the XX:MaxDirectMemorySize parameter; otherwise the limit equals the non-native heap maximum, the Xmx parameter. Severe latency can occur if an attempt is made to allocate more than the limit—allocateDirect() will issue a call to System.gc() in an attempt to free up resources! Additionally, even if the programmer has been careful to set aside enough native memory and set the parameters appropriately, the availability of that native memory is still subject to determination by the JVM garbage collector. In other words, the native memory may be freed eventually, but it may not be freed in a timely (enough) fashion, causing unacceptable runtime latencies.

The answer is to use a ByteBuffer pool which manages the native memory, guaranteeing that the limit is never breached and taking an unscheduled System.gc() out of the equation. Additionally, the pooling can result in a very efficient use of memory resources.

With a properly functioning direct ByteBuffer pool, the programmer has a rare opportunity to utilize native memory directly within Java and without resorting to JNI. The known downside, of course, is that the objects must be returned after use since their lifespan is now determinate. (For this reason it is wise to try to avoid escaping the usage of the pooled objects very far down the processing stream and coding logic. Try to keep the borrow/return usage somewhat self-contained). But the upside can be tremendous. Consider a TCP communication layer which immediately transfers the data directly to native memory using ByteBuffer-based reads. The raw data no longer resides in the garbage collection area, and that memory is now also free for other objects. It reduces the load on the garbage collector and eliminates the possibility that particular first-order data (at least) is ever promoted to long term memory. Talk about coding for high throughput and low latency. I have implemented a TCP communication layer that does just that and it performs extremely well.

Use ThreadLocal for thread-aligned objects. Consider using ThreadLocal if the usage of a singular object also happens to align with one thread at a time and there are a lot of these objects being created otherwise. Of course ThreadLocal can be used just for its thread alignment benefits as well, regardless of the object count. ThreadLocal internally caches one instance of the object with each calling thread. SimpleDateFormat, a class which is not inherently thread-safe, is one good candidate. Using ThreadLocal for SimpleDateFormat makes sense because it simultaneously reduces the number of instances and also encourages thread-safe usage (barring the programmer escapes the instance to another thread). Multi-threaded database routines can also make good use of ThreadLocal for the connection to the database, for example.

Cache objects outliving short term memory. Objects outliving the short term cleaning phase may get promoted to long term memory. If there are a lot of them and many of them are identical, it is better to utilize caching to have just one instance (of a given identical object) promoted to long term memory instead of many. Even though these are free for destruction later, if they outlive their stay in short term and are promoted to long term memory, the cost for freeing them is greater—they are no longer within reach of the efficient multi-threaded young generation garbage collector.

To illustrate, consider the scenario where a user may subscribe to a particular stock symbol for market data, for example. There may be hundreds of thousands of symbols and thousands of users. The symbol object’s longevity is subject to the length of time the user maintains the subscription—these objects will most likely get promoted to long term memory. On the server side (because more memory is available) it is often feasible to make such objects immutable and cache them. The cache can be populated either up front during an initialization phase (preferable, if possible) or dynamically as the need for new objects arises. Symbol subscriptions, in this example, would reference the cached symbol object if present, instead of creating a new one each time.

1.4 Choose natives over objects

Of course objects have to be used, and used with near reckless abandon, but be aware of the cost associated with creating an Integer instead of an int in a high throughput path. Also understand the concept of auto-boxing and program so that it doesn’t occur if found the path of high-volume data. Processing large volumes of data may not go well if an int passed into a method gets unnecessarily promoted to an Integer , or a double to a Double…100,000 times per second.

1.5 Stop collection abuse

Selecting the right collection is important. High throughput programs can have collections handling hundreds of thousands of objects. Most collections have internal overhead per stored object which facilitate the collection mechanism but also impact the memory. (ArrayList is one that does not have the per stored object overhead, but has resizing impact which you should be aware of). Additionally, the costs associated with additions, deletions, iteration, and resizing vary from one implementation to the next. The costs may well go unnoticed in a non demanding application but be show stoppers in a high throughput one. On-the-order-of costs are generally documented for the standard collections and implementations.

For example, understand (verifying through profiling, if necessary) the impact on your program between using a LinkedList, an ArrayList, or an ArrayDeque. There could be big differences in performance in a high throughput situation, depending on what you are doing to the collection. It may be desirable to recode so that one collection may be used over another. It is possible the way that a collection is being used in your application introduces latency or an abundance of avoidable internal container objects.

One useful strategy for critical areas of code is to attempt to eliminate the use of the internal overhead of collections altogether—and to use a native array instead. This back-to-basics approach requires indexed access which is not always possible. However, the effort spent figuring out how to represent a very large collection of objects with a fixed-size array can have great payout. Creativity may be required finding a suitable indexing scheme if something isn’t readily apparent. A native array gives another benefit in iteration—an iterator against a typical collection creates a per-iterated-object overhead, in addition to the per object overhead that potentially already exists in the collection itself, while an array can be iterated by accessing its elements directly, and no per object overhead is incurred when an array is iterated in that fashion.

For thread safe access into a fixed size array, consider AtomicReferenceArray. This class can help not only with the per-object overhead associated with collections, but can also take care of concurrency if the usage contract fits.

1.6 Use threads—use the CPUs

I met with a manager whose team was tasked with writing Java server programs supporting thousands of simultaneous users, processing large volumes of data. He told me his team liked to avoid using threads. Needless to say, their software was having problems scaling to meet demand. I also found out they were using mere dual core server machines. I then realized that 2 cores was probably just about right for them—their programs were taking advantage of all of about one CPU because of their programming technique.

Java threads make amazingly good use of the CPU resources. Threads will take your program to the next level in throughput performance. It really should go without saying, but I say on.

The more CPUs available, the more horsepower you have at your disposal. And don’t worry about limiting your program to a single thread per CPU, or anywhere even close to that. What happens if you have hundreds of threads on a dual quad core machine? Surprisingly fantastic throughput is what can happen as long as the programmer has eliminated the numerous roadblocks and minimized the lock contention (next section) that he has most assuredly introduced himself.

Use hardware with multiple cores (as of this writing, even handhelds will be sporting quad cores within the year) and then write programs to take advantage of those cores!

1.7 Find and fix lock contention

Lock contention will be one of the greatest enemies to throughput you will encounter. No matter how many years of experience you have, you will introduce unnecessary lock contention in your multi-threaded high throughput program. To find these areas for a program with any degree of complexity, you will have to break out the profiler. You will also need to become relatively expert at understanding and improving synchronization. There are good books on concurrency. Read and go forth.

Along the way, do not forget the synchronize-facilitating Java constructs that exist in addition to the synchronize keyword. Where appropriate use volatile (limited use since it does not always guarantee atomicity—I like to use it primarily for the “volatile boolean die” indicator within a running thread) and, more commonly, the atomics. Use the atomics found in java.util.concurrent.atomic. They will often allow you to eliminate an explicit synchronize which can eventually get in the way of increasingly high throughput data.

Additionally, the Concurrents (ConcurrentHashMap, ConcurrentLinkedQueue, ConcurrentSkipListMap/Set) are your friend when the additional behavior specific to these implementations is also appropriate for your design. The synchronization internal to these versions of the collections has been designed to be minimized and, in some cases, eliminated.

At this point is where things get interesting, maybe even controversial. But the controversy lasts only until the mechanisms described are tried and proven. With increasing high throughput, as data is passed from on thread to another, the semaphore mechanism synchronizing the inter-thread transfer itself can start to become the bottleneck. Even with no lock contention, with one thread delivering the data and one receiving it, wait()-ing on the semaphore itself can slow things down. Opening and closing the gate hundreds of thousands of times each second is not very efficient. What to do?

I created two inter-thread non-locking classes with different characteristics to address different usage contracts that arise. These help overcome the bottleneck of opening and closing the synchronize gate hundreds of thousands of times each second. The potential controversy is that they consist of a running thread with a configurable self-contained wait period. If the thread is going to be very busy processing data, it might as well be very busy…period. But it is actually much better than that. With carefully designed wait and pausing even as low as 1ms between data processing moments (when no data is available), the CPU load barely registers, even with no data. And at most there will be a 1ms delay when it finally does wait. It is the best of both worlds.

1.8 Check the third party code

There will be third party code or code in the Java libraries themselves which might affect your code in an adverse way, and you are going to have identify these and work around them. Again, many can be non issues in an undemanding application, but prove critical in the applications we are talking about here.

I mentioned the full garbage collection which can take place with direct ByteBuffers (and how to work around it). One more thing, for example, that I discovered was the affect of the JMX beans. Using JConsole to monitor the applications, I observed about 20 MB per minute being added to the heap, and worse, being promoted to long term memory. JConsole itself, in its activation of the beans in the application, is associated with the memory abuse. I still like to fire up JConsole for its simplicity and the information I can glean about a program at a glance, but I am also aware of the affects it has on the very program I am trying to monitor.

You are going to have to find these and eliminate or work around them if they are hindering throughput or creating excessive garbage and abusing memory.

1.9 Profile

At the end of the day, you are going to have to profile your application with the profiling tool of your choice (JProfiler, etc.) to identify and root out the dozens of bottlenecks and/or memory abuse areas you introduced in your code. In spite of being the most skilled, conscientious, detail-oriented programmer ever, the problems will be there. Walk down the CPU usage details and either find a reasonable explanation or go fixing. Walk down the memory details and either find a reasonable explanation or go fixing. Spend a lot of time here; it’s that important.


Chapter 2 – Garbage Collector Settings


(The following garbage collector settings were formulated and implemented before G1GC was mainstream. It is unknown to what extent G1GC would now succeed in replacing or improving upon the following strategy given the requirement of eliminating GC pauses altogether)

2.1 Strategy

Time spent improving an application using the techniques covered in the first chapter is a prerequisite to success with the next step. With the programming inefficiencies, wasted CPU cycles, unnecessary object creation, and garbage collector abuse out of the way (and the program maintained to stay that way) it is possible to consider potentially nearly eliminating the effect of the garbage collector altogether. This may or may not be possible for a given application because of its data profile or memory limitations. Increasingly, however, with memory so affordable, it is not only possible but very achievable.

This strategy is to outlive every garbage collector process (except the young generation collection of course) using ample memory and CPU and by configuring the JVM for efficient young memory collections. The rest of the potential garbage collector interference points are configured out of the picture to the extent possible. On top of that strategy, there is an additional backstop for those applications still promoting too much memory to long term: schedule full garbage collections at a known times when a pause can be tolerated, as previously mentioned.

Using this strategy has allowed me to design extremely demanding server programs that require a full garbage collection or restart just once each week. It depends on and is relative to the application and its own data requirements, of course, but dare go down this path. Give a program that you have refactored/fixed/rewritten based on the first chapter’s recommendations some extra gigabytes to outlive the garbage collection and watch the unscheduled pauses disappear forever.

2.2 Details

The young memory should be as large as possible, tailored to the number of available CPU cores on the machine.

Cores Max ParallelGCThreads Young Memory
2 1 100m
4 3 700m
8 6 1600m


Additionally, the young memory and the total memory should be fixed sizes (the min and max are the same), and all automatic garbage collecting adjustments and fluctuations should be turned off.

The total memory then given to the application is increased to reflect what your program requires to outlive anything other than the young generation garbage collection.

This yields the following recommendations, for example, for a program running on a quad core machine and it having been determined that 5 gigabytes is sufficient to prevent the program from running anything but a young generation garbage collection between restarts or between carefully scheduled calls to System.gc():

 
// An example for a quad core machine and 5G having been determined
// sufficient to outlive the full gc (daily or weekly) after Chapter 1
// improvements have been fully implemented.
			
-Xms5000m                   // change this as needed
-Xmx5000m                   // same as Xms
-XX:ParallelGCThreads=3     // use the table above
-XX:NewSize=700m            // use the table above
-XX:MaxNewSize=700m         // same as NewSize

// Remember to add the parameter that gives you an expanded amount 
// of native memory (if any and as needed) for your managed native 
// ByteBuffer pool, discussed in the first chapter:

-XX:MaxDirectMemorySize=700m // change this as needed

// Add additional parameters which are either recommended or useful 
// in configuring the garbage collector out of the way:
 			
-server                     // server mode
-XX:+UseTLAB                // probably the default, but important
-XX:+UseBiasedLocking       // optional improvement for uncontended locking
-XX:+PrintGCDetails         // to see how good you are
-XX:+UseParallelGC          // of course
-XX:SurvivorRatio=8         // young promotion tuning
-XX:TargetSurvivorRatio=50  // young promotion tuning
-XX:MaxTenuringThreshold=5  // young promotion tuning
-XX:-UseAdaptiveSizePolicy  // turn off adaptive resizing

© 2017 PixelTwenty LLC
info@pixeltwenty.com