Académique Documents
Professionnel Documents
Culture Documents
August 2010
Technical white paper
Table of Contents
Introduction ................................................................................................................................... 3 Essential Tools on HP-UX for Java Applications..................................................................................... 3 HPjconfig................................................................................................................................... 3 Java Out-of-Box (JavaOOB)........................................................................................................... 4 HPjmeter.................................................................................................................................... 4 Collecting Profile Data for HPjmeter ............................................................................................ 5 Collecting Garbage Collection Data for HPjmeter.......................................................................... 7 GlancePlus (glance or gpm) .......................................................................................................... 8 gpm (GUI mode)...................................................................................................................... 9 Glance Screen Mode ............................................................................................................. 10 Glance Adviser Mode ............................................................................................................ 10 Other Tools .............................................................................................................................. 11 Configuring Patches and Kernel Parameters for Java on HP-UX ............................................................ 11 Running HPjconfig in GUI mode (default)...................................................................................... 12 Running HPjconfig in non-GUI mode............................................................................................. 14 Key Factors Affecting Performance................................................................................................... 14 Java Heap Size and Garbage Collection Behavior.......................................................................... 15 Garbage Collection in HP's Hotspot JVM ................................................................................... 15 JVM Heap and GC Parameters ................................................................................................ 16 Default GC Policies and Heap Settings on HP-UX ........................................................................ 17 Migrating from Solaris............................................................................................................ 17 Migrating from IBM/AIX ......................................................................................................... 18 Confirm Garbage Collection Behavior using HPjmeter.................................................................. 19 Thread Behavior and Lock Contention........................................................................................... 24 Detecting Lock Contention in Your Application ............................................................................ 24 Reducing Lock Contention in Your Application ............................................................................ 28 Deployment of Java Instances and Processor Usage......................................................................... 30 Other Factors ........................................................................................................................... 35 OS Scheduler........................................................................................................................ 35 Hyper-threading .................................................................................................................... 35 Other Java Options................................................................................................................ 35 System Components ............................................................................................................... 36 Memory Footprint of Migrated Java Application................................................................................. 36 Java Process Memory Footprint.................................................................................................... 36 Tools to Examine Java Process Memory Footprint ............................................................................ 37 Threads in the Java Process......................................................................................................... 38 Reducing Starting Footprint ......................................................................................................... 38 For More Information..................................................................................................................... 40 Call to Action............................................................................................................................... 40
Introduction
HP offers a full range of Java technology products on HP-UX 11i systems. We provide solutions to develop or deploy Java applications with the best performance on HP PA-RISC 11i v1(11.11), 11i v2 (11.23), 11i v3 (11.31), and HP Itanium 11i v2 (11.23) and 11iv3 (11.31). This document provides guidance on how to easily migrate your existing java applications from other platforms to HP-UX. The topics covered are:
Essential Tools on HP-UX for Java Applications Configuring Patches and Kernel Parameters for Java on HP-UX Key Factors Affecting Performance Memory Footprint of Migrated Java Applications
HPjconfig
HPjconfig is a tool used in configuring and tuning HP-UX 11i systems for running Java workloads. It provides recommended kernel parameter settings and patch information needed for running Java on HP-UX. HPjconfig is a pure Java application. To run, it requires the following: Java 1.4.2.x or later HP-UX 11i v1 or later HP Integrity Itanium or HP 9000 PA-RISC system To download HPjconfig, go to:
https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=HPJCONFIG
HPjmeter
HPjmeter is a Java performance analysis tool that helps identify and diagnose performance problems in your deployed Java application. It can be used in your production environment as well as during development. HPjmeter has two modes of operation:
Real-time monitoring of your running Java application Static data analysis (off-line) of data collected from your java application using the profiling or garbage collection options
Automatic problem detection and alerts: o o o o o o o o o Memory leak detection alerts with leak rate Thread deadlock detection Abnormal thread termination detection Expected out of memory error Excessive method compilation System and process CPU utilization thresholds Heap usage thresholds Garbage collection duration Finalizer queue length
Dynamic real-time display of application behavior: o o o o o o o o o Java heap size Garbage collection events and percentage time spent in garbage collection CPU usage per method for hottest methods Object allocation percentage by method and by object type Method compilation count in the JVM dynamic compiler Number of classes loaded by the JVM and activity in class loaders Thrown exception statistics Multi-application, multi-node monitoring from a single console Applications are ready to monitor. At application start, no HPjmeter options are required to monitor the application (with Java 6.0.03 or later)
o o o o o o o
Graphic display of profiling data Threads histogram and lock contention metrics CPU/clock time metrics for Java methods Call graphs with call count, CPU, or clock time Per-thread display of time spent in nine different states Per-thread or per-process display of metrics Reference trees for heap analysis
In-depth garbage collection analysis: o o o o Graphical display of Java heap usage and garbage collector behavior over time Details of GC type, GC frequency, GC duration, object creation rate, cumulative memory allocation User-configurable graphs for presenting all heap and GC data Detailed summaries of GC activity and system resource allocation, along with other system and JVM runtime and version data
This document describes using some of HPjmeter's static data analysis features to analyze the performance of your migrated Java application. For more details, as well as information on HPjmeter's Real-Time capabilities, refer to the Java Troubleshooting Guide for HP-UX Systems, and the HPjmeter User's Guide. The steps for collecting the profiling and garbage collection data for static data analysis are described below. See the section Key Factors Affecting Performance for steps on viewing the data in HPjmeter to identify common Java performance problems.
You can send the Xeprof output to a specified file using the file= keyword as follows:
$ java -Xeprof:file=<yourApp>.eprof <yourApp>
The <pid> will be inserted automatically in the file name, for example,
<yourApp><pid>.eprof.
To collect Xeprof data for a specified time interval, there are two options: a. Turn on/off profiling based on specified time since program start:
$ java -Xeprof:time_on=<start_time>,time_slice=<length_of_collection_time> <yourApp>
b. Turn on/off profiling using signals (for example, sigusr1 and sigusr2):
$ java -Xeprof:time_on=sigusr1,time_slice=sigusr2 <yourApp>
The generated filename will include the time between the start of an application and the start of profiling. For example:
java<pid>_<t>.eprof
NOTE: If you are running JDK 1.5.0.04 or later, the command-line option is not required to capture eprof data. Instead, you can toggle eprof data gathering on and off by sending signals to the currently running Java VM. One log file is produced per sample period; the name for the log file is java<pid>_<startTime>.eprof. The SIGUSR2 signal toggles the recording of eprof data. Use the following process to gather eprof data for specific periods:
Send SIGUSR2 to the Java VM process. The Java VM begins recording eprof data. Send SIGUSR2 to the Java VM process. The Java VM flushes eprof data and closes the log file.
For more information, see Profiling with Zero Preparation in the HPjmeter User's Guide. 2. Run the application to create a data file. 3. Start the HPjmeter console from a local installation on your client machine. For example, here are two different ways:
$ $JAVA_HOME/bin/java <heap_settings> -jar /opt/hpjmeter/lib/HPjmeter.jar
$ /opt/hpjmeter/bin/hpjmeter
4. Click File>Open File to browse for and open the data file. 5. A profile analysis screen opens, displaying a set of tabs containing summary and graphical metric data. The following screen shows an example: Figure 1 HPjmeter Profile Data
Click among the tabs to view available metrics. Use the Metrics or Estimate menus to select additional metrics to view. Each metric you select opens in a new tab. Hover your mouse
over each category in the cascading menu to reveal the relevant metrics for that category. The following screen shows the available metrics for the threads/locks category: Figure 2 HPjmeter Threads/Locks Metrics
NOTE: If you are running JDK 1.5.0.14, JDK 6.0.06, or later versions, you can start GC data collection without using the Xverbosegc option on the command-line (zero-prep Xverbosegc). Instead, you can toggle GC data collection on and off by sending a SIGPROF signal to the currently running Java VM. The GC data will be written to a file named java_<pid>.vgc. To start and stop GC data collection with zero-prep, execute: kill PROF <pid> or kill -21 <pid> See Collecting GC Data with Zero Preparation in the the HPjmeter User's Guide for more information. 2. Run the application to create a Xverbosegc data file. 3. Start the HPjmeter console from a local installation on your client machine
$ java <heap_settings> -jar /opt/hpjmeter/lib/HPjmeter.jar
4. Click File>Open File to browse for and open the data file. 5. A GC viewer screen opens and displays a set of tabs containing metric data. The following figure shows the Garbage Collection Summary Analysis screen: Figure 3 HPjmeter Garbage Collection Summary Analysis Screen
(Using -XX:+PrintGCDetails and -XX:+PrintHeapAtGC together can cause intermixed output, which cannot be parsed by HPjmeter. We recommend not using these two options together.) 3. Run the application, and view the resulting data file in HPjmeter.
(Your OE may already come with GlancePlus installed.) This section provides brief highlights on using glance/gpm to monitor Java applications. There are 3 ways to run glance or gpm:
The following figure is an example of the Main panel: Figure 4 GlancePlus (gpm) main panel
The Main Panel presents an overview of the system with a view of CPU, Network, I/O, and Memory. The Reports menu lets you select more detailed information, both global to the system, and processspecific. To examine an individual process, select: Reports>Process list (select the process you want to monitor) From the process screen, select more details (for example, system calls, memory regions, thread list...)
Detailed on-line help is provided on the Main Panel. Additional examples of using gpm are provided in the section Key Factors Affecting Performance.
To invoke script: 1. Start up your Java process. Get its process id. 2. Start up glance to collect data:
javaGlanceAdviser.ksh <pid> [output-file] [sampling-interval]
e.g.
javaGlanceAdviser.ksh 1613 ga.out.1613 5
This will collect data every 5 seconds for process 1613 and write output to file ga.out.1613. If the run is going to be really long, you can change this interval to be less frequent, maybe every 20 seconds.
10
Here is an example of partial output from running the glance adviser script:
GBLH: date time swap nfile cpu_tot cpu_sys cpu_usr run_q mem_virt mem_res pageout_rate pageout_rate_cum b ufcache_hit_pct disk_io net_in_pkt_rate net_in_err_rate net_out_pkt_rate net_out_err_rate GBL: 05/28/2010 13:33:00 16 1 50 5 46 0 7995mb 21.45 0 0 99.30 1.60 12471 0 13779 0 PH: proc_name pid cpu cpu_sys cpu_usr vss rss data_vss data_rss threads files io disk_io P: java 28818 323.60 27.30 296.30 3534380kb 3401580kb 77660 77660 89 173 0.00 0.00 PSH: syscall_name count count_cum rate rate_cum total_time total_time_cum PS: read 281334 550369 28417.5 18345.6 107.62 209.63 PS: write 34005 65958 3434.8 2198.6 0.09 0.17 PS: ioctl 11559 22372 1167.5 745.7 9.52 18.31 PS: poll 11337 21990 1145.1 733.0 20.02 40.03 PS: send 144508 283858 14596.7 9461.9 1.62 3.17 PS: sched_yield 447103 1151559 45161.9 38385.3 0.40 1.00 PS: ksleep 17466 31275 1764.2 1042.5 403.69 601.73 PS: kwakeup 17392 31131 1756.7 1037.7 0.04 0.06
Other Tools
If lower-level analysis is required, additional tools are available such as HP caliper, tusc, gdb and others. The standard JDK tool set is available as well. See Java Tools for HP-UX: Quick Start and Migration Guide for a complete list.
HPjconfig can be run in GUI mode or non-GUI (command-line) mode. In either mode, HPjconfig generates a summary of configuration information in a log file. The default name is:
HPjconfig_<hostname>_<date>_<timestamp>.log
11
5. The following four figures show the System, Application, Patches, and Tunables tabs for the HPjconfig tool in GUI mode: Figure 5 HPjconfig SystemTab
12
13
usage: HPjconfig [ options ] -gui HPjconfig [ options ] <object> <action> objects: -patches &| -tunables actions: -listreq | -listmis | -listpres | -apply options: -patches -tunables -listreq -listmis -listpres -apply -javavers s -data s -[no]gui -logfile s -proxyhost s -proxyport s -help -version operate on java-specific patches operate on java-specific tunables list all java required patches or tunables that are applicable to this system list missing java-specific patches or tunables on the system list applied (installed) java-specific patches or tunables on the system apply (install) missing java-specific patches or tunables on the system java versions for selecting patches e.g 1.2, 1.3, 1.4, 5.0, 6.0 local data file with java-specific list of patches and tunables run in GUI mode name of log file HTTP proxy host name for accessing live data HTTP proxy port for accessing live data show help string and exit show version string
Examples: List options available in non-gui mode: $ java -jar ./HPjconfig.jar -nogui -help List missing patches: $ java -jar ./HPjconfig.jar -nogui -patches -listmis List required tunables: $ java -jar ./HPjconfig.jar -nogui -tunables -listreq List present patches and tunables $ java -jar ./HPjconfig.jar -nogui -tunables -patches List present patches and tunables and write to specified log file
$ java -jar ./HPjconfig.jar -logfile my.log -nogui -tunables -patches listpres
listpres
The log file produced can be used for remote analysis of your machine.
Java Heap Size and Garbage Collection Behavior Thread Behavior and Lock Contention Deployment of Java Instances and Processor Usage Other Factors (OS Scheduler, HT, ForceMmapReserved, Exceptions, System Components)
14
New Space for newly created objects Old Space for longer-lived objects Perm Space for reflective data or "permanent" objects
The New Space is further divided into 3 regions: eden and two survivor spaces (to and from). Newly created objects get allocated into the eden space. When eden becomes full, a scavenge GC is triggered. Live objects in the eden space are identified and copied into the survivor space (to), and garbage is removed from eden. To space then becomes the from space. The process starts again, with newly created objects going into eden and scavenge GCs getting triggered when eden is full. Live objects are then copied from eden and from into to. After objects have survived enough scavenges, they get migrated to old space, or tenured. The tenuring threshold determines how many scavenges an object survives before getting migrated to old space. Eventually, when the old space is determined to be too full, a Full GC is performed. A Full GC collects both the new and old spaces, examining all objects in the entire heap. Scavenge GCs are usually quick, whereas Full GC's are very time-consuming. The principle behind generational garbage collection is that many objects in an application are short-lived and will be garbage-collected while in the New Space. Therefore, overall GC time is reduced for the most common case.
Basic "serial" collector (described in the previous section) This collector is a single-threaded, stop-the-world collector. On the new space, a fast copying collector is used to perform the scavenge GC. On the old space, a mark-compact copying collector is used to perform the Full GC. The Full GC requires scanning and processing the entire heap and is very time-consuming. Parallel scavenge of New space (high-throughput collector) This collector uses multiple threads to perform the scavenge GC on the New space, thereby reducing the total stop-the-world pause time. To enable parallel scavenge, use the JVM
15
option:
-XX:+UseParallelGC
The JVM determines the number of parallel threads using heuristics based on number of CPUs available. You can also explicitly set the number of parallel GC threads with:
-XX:ParallelGCThreads=n
Old Generation Concurrent Mark Sweep (CMS) collector (low pause collector). This collector runs mostly concurrently with the application. It attempts to clean out the Old Generation before it gets full to avoid the costly Full GC. There are still two stop-the-world pauses, but they are are very short compared to a Full GC. To enable CMS, use the JVM option:
-XX:+UseConcMarkSweepGC
Turning on this option automatically enables the following on the New space:
-XX:+UseParNewGC
UseParNewGC is a parallel scavenge specifically intended to work with low-pause CMS. CMS is initiated when Old Space occupancy reaches a specified percent, determined by CMSInitiatingOccupancyFraction:
-XX:CMSInitiatingOccupancyFraction=<percent>
CMS reduces large GC pause times, but incurs some additional CPU overhead while the application is running.
Old Generation Parallel GC (available in jdk 5.0 and later) This collector uses multiple GC threads to perform a Full GC when the Old space gets full, thereby reducing the very long pause times caused by a Full GC. To enable ParallelOldGC, use the JVM option:
-XX:+UseParallelOldGC
The JVM determines the number of parallel threads using heuristics based on number of CPUs available. You can also explicitly set the number of parallel GC threads with:
-XX:ParallelGCThreads=n
16
-XX:+UseConcMarkSweepGC Enable Concurrent Mark Sweep (CMS) collector -XX:+UseParallelOldGC Enable parallel Full GC -XX:+UseAdaptiveSizePolicy Attempt to resize subspaces to produce more optimal GC behavior
17
HP-UX. If you are unsure of the settings, or some are not set (and relying on defaults), then collect the GC details from the JVM running on Solaris using:
-Xloggc:loggc.solaris -XX:+PrintHeapAtGC
On HP-UX, set the -Xmx, -Xms, -Xmn parameters based on above values from Solaris, and set GC policy to be the same as on Solaris. Add the following to verify that your settings are comparable:
-Xloggc:loggc.hpux -XX:+PrintHeapAtGC
The PrintHeapAtGC outputs can be viewed in HPjmeter. The following figure shows a sample screenshot of PrintHeapAtGC data in HPjmeter: Figure 9 GC Summary panel from PrintHeapAtGC output
18
-Xgcpolicy:gencon Generational and concurrent. Generational collector with a concurrent mark phase. -Xgcpolicy:subpool Similar to default policy optthruput, but employs an allocation strategy intended to perform better on machines with 16 or more processors. (available on pSeries and zSeries)
If on IBM's JVM, your application is running with "optthruput", then try the default basic GC policy on HP-UX. If your application is running with "optavgpause" on IBM, then you can try CMS (UseConcMarkSweepGC) on HP-UX. If your application is running with "gencon," first try the basic GC policy on HP-UX and then compare it to using CMS. If your application is running "subpool," on a large multi-processor IBM machine, consider using strategies listed in the section "Deployment of Java Instances and Processor Usage" To determine the heap settings used on the IBM JVM, use the options:
-verbose:gc or -Xverbosegclog:<filename>
Note the Xverbosegclog option on the IBM JVM produces an XML file, and is different from the -Xverbosegc option on the HP JVM. Try to set comparable heap settings on HP-UX. Because there is no direct mapping between GC policies from IBM to HP-UX, some experimentation and iteration will probably be necessary.
19
The "Heap Usage After GC" panel shows the application running close to the maximum heap size and hence, Full GCs are invoked frequently: Figure 11 Heap Usage After GC Panel
20
The "Duration" panel shows the very long Full GC times. In contrast, note the scavenges GCs are extremely fast, as expected: Figure 12 GC Duration Panel
A typical remedy to the problem of Full GCs occurring too often is to increase the size of the heap, in particular the Old space. However, for the above example, increasing the heap size would have been insufficient for solving the problem. Several improvements were made to the above application including reducing the cache timeout so that objects do not live as long, and optimizing the cache implementation. This resulted in substantial performance improvement as seen in Figures 13 and 14. The Summary panel shows that the percentage time spent in GC has been reduced to 3.6%, with Full GCs occurring every 30 minutes instead of every 70 seconds. However, the Full GC durations are still 28 seconds on average.
21
The caching mechanism was eventually replaced with a more efficient implementation. This cut the GC duration roughly in half, as shown in the following figure:
22
(Note: When migrating your Java application from another platform, you should not have to make modifications to the application itself. The example above is being used to illustrate HPjmeter's functionality. It is not meant to imply that application changes are required) Other recommendations: The UseAdaptiveSizePolicy is intended to resize the subspaces of the java heap while the application is running in order to improve GC behavior. Occasionally, it can cause a ping-pong effect, where every scavenge causes the subspaces sizes to bounce back and forth. In such cases, it is best to disable Adaptive Sizing (-XX:-UseAdaptiveSizePolicy). When using CMS, make sure that the CMS collection is successfully cleaning out the Old space before the Old Space gets completely full. If the Old Space gets full, the Hpjmeter will show an "incomplete CMS" and a regular Full GC will occur. In such cases, try lowering the value at which a CMS is initiated, using -XX:CMSInitiatingOccupancyFraction=<percent>. For further information on garbage collection in Hotspot and using HPjmeter to optimize GC performance, refer to the following: Tuning Garbage Collection with the 1.4.2 Java Virtual Machine
http://java.sun.com/docs/hotspot/gc1.4.2/index.html
23
2. Look at system call output for each interval. 3. High rates of sched_yield, ksleep, or kwakeup indicate lock contention (see 3rd column). Figure 16 shows partial glance output with some very high sched_yield rates (e.g, 85K per second): Figure 16 Output from glance advisor script (1 record)
GBLH: date time swap nfile cpu_tot cpu_sys cpu_usr run_q mem_virt mem_res pageout_rate pageout_rate_cum bufcache_hit_pct disk_io net_in_pkt_rate net_in_err_rate net_out_pkt_rate net_out_err_rate GBL: 05/28/2010 13:33:10 100.00 1.80 12612 16 0 1 51 13959 5 0 46 1 7995mb 21.46 0 0
PH: proc_name pid cpu cpu_sys cpu_usr vss rss data_vss data_rss threads files io disk_io P: java 173 0.00 28818 0.00 353.78 29.58 324.20 rate 3534380kb rate_cum 3403056kb 77660 77660 89
count_cum
total_time total_time_cum
24
PS: read PS: write PS: ioctl PS: poll PS: send PS: sched_yield PS: ksleep PS: kwakeup
In addition, the glance output provides useful data regarding the overall system performance. In particular, note the CPU usage 353% (user 324% and system 29%), the process size (vss 3.5GB and rss 3.4GB), and 89 threads running.
Using HPjmeter
To examine lock contention in HPjmeter: 1. Collect Xeprof profile (see the section Collecting Profile Data with HPjmeter). 2. Open file in HPjmeter (File->Open File) Figure 17 shows the HPjmeter Summary screen, including the running time (that is, for how long the profile was collected), and shows the application has 313 threads: Figure 17 HPjmeter Summary screen
3. Click on Threads Histogram tab The threads histogram shows each thread, its lifetime, and a color-coded set of states indicating how the thread is spending its time: lock contention, garbage collection, CPU, I/O, sleeping, waiting, and so forth. "Red" indicates "lock contention"
25
Figure 18 shows this application has an huge amount of lock contention, with every thread showing a large amount of red: Figure 18 Threads Histogram screen
4. Select a thread, and double-click on it. This brings up a pie chart with a breakdown of time spent in each of the states. Figure 19 shows this thread is spending 72% in lock contention and 24% time in Garbage collection, and no CPU time spent. In other words, the thread is not getting any real work done: Figure 19 Threads Histogram with pie chart
26
5. Determine where the lock contention is occurring. Select Metrics->Threads/Locks->Lock Delay - Method Exclusive Figure 20 shows the highest lock delay (time spent waiting to acquire a lock) is coming from method: weblogic.utils.classloaders.ChangeAwareClassLoader.loadClass. Figure 20 Lock Delay Method Exclusive
6. Find where the method is called in the call graph tree (Figure 21): Click on <method name> in list. Select Edit->Mark to mark the method for finding later. Select Metrics->Threads/Locks->Lock Delay-Call Graph Tree (brings up call graph tree) Select Edit->Find Immediately (finds method in call graph tree).
27
7. Look at lock delay at thread level instead of process level (Figure 22): Select Scope->Thread (Change to thread scope) Select Metrics->Threads/Locks->Lock Delay - Method Exclusive Figure 22 Thread level Lock Delay - Method Exclusive
Tune the thread count in the application (most likely reduce number of threads) Tune deployment of Java (see next section "Deployment of Java Instances and Processor Usage") Make application modifications to reduce lock contention (break up locks or hold locks for shorter time) Modify thread scheduling policy (see section Other Factors)
28
Figures 23 and 24 shows the Threads Histogram and pie chart after reducing the thread count from 300 to 100, improving class loading, and rearchitecting parts of application to reduce lock contention. Lock contention is substantially reduced with these improvements. Figure 23 Threads Histogram screen (after improvement)
29
Throughput or performance is not what you expected Java process does not appear to be scaling with increased number of cores Application is experiencing very high lock contention Use glance and HPjmeter to analyze lock contention (see section Thread Behavior and Lock Contention above) CPU is very underutilized with the addition of more cores. Use glance to observe. Your application is experiencing sporadic high GC latencies Use HPjmeter to look at GC duration.
Running multiple Java instances on larger machines versus a single Java instance has several advantages:
Lock contention is potentially reduced. Rather than all threads waiting on a single lock in a single instance, having multiple instances essentially breaks up the single lock into multiple separate locks, thereby reducing lock contention overall.
The effects of pause times for GCs can be reduced. o As you try to increase the load on your system, with a single Java instance, you would need to continue to increase the size of your heap to accommodate the increased object creation rate. Assuming the basic garbage collector, increasing the total heap size results in longer stop-the-world pause times when GC occurs. On the other hand, with multiple instances of Java, each instance can have a smaller heap size since you have multiple instances handling the increased load. The stop-the-world pause time for each GC in a small instance would be less than with one huge instance. With the basic garbage collector, each time a GC occurs, all threads stop and are brought to a safepoint. Then, GC occurs. If you have a single instance of Java, then the entire instance is stopped waiting for GC to complete. With multiple instances, one Java instance could be doing GC, but the others can continue to execute the actual application.
Running multiple Java instances each in its own processor set offers additional advantages of improved locality:
Ensures Java processes use local memory (cell-local or socket-local) for faster allocation and memory access
30
Reduces cache-to-cache misses by keeping accesses local and preventing scheduler from moving processes across locality domains
For an application server such as Oracle Weblogic, we recommend assigning one Java instance to a processor set with 2-4 cores. For instructions on using processor sets, refer to man pages man psrset or
http://h71028.www7.hp.com/enterprise/us/en/os/hpux11i-prm-processor-sets.html
Figures 25 and 26 below illustrate an example of an application, using Oracle Weblogic application server, which initially did not scale when moving from 2 cores to 8 cores. After changing to run multiple Java instances, each in a 2-core processor set, the application achieved the desired scaling and goal performance. Figure 25 shows the Threads Histogram for a single instance of weblogic running on 8 cores. The 8 weblogic.socket.Muxer threads show substantial lock contention; one of the threads is at 88% lock contention. (This is a common scenario.) Figure 26 shows the result after running multiple weblogic instances each in a 2-core pset. The number of socket.Muxer threads was also reduced to 3 per instance (instead of 8). The screen shows the Threads Histogram for a single instance. Although there is still some lock contention, it has been noticably reduced. Figure 25 Threads Histogram (before)
31
Figure 28 shows one sample glance record after multi-instance/psets. Figure 28 Sample glance record (after)
... GBLH: date time swap nfile cpu_tot cpu_sys cpu_usr run_q mem_virt mem_res pageout_rate pageout_rate_cum bufcache_hit_pct disk_io net_in_pkt_rate net_in_err_rate net_out_pkt_rate net_out_err_rate GBL: 06/04/2010 14:42:32 19 1 27 3 24 0 13529mb 24.53 0 0 100.00 3.00 10326 0 11498 0 PH: proc_name pid cpu cpu_sys cpu_usr vss rss data_vss data_rss threads files io disk_io P: java 22925 178.80 16.40 162.20 3514828kb 3382284kb 73564 73564 68 172 0.20 0.20 PSH: syscall_name count count_cum rate rate_cum total_time total_time_cum PS: read 98957 747895 20195.3 5527.6 31.15 227.78 PS: write 11274 87629 2300.8 647.6 0.03 0.23
32
Focusing just on sched_yield rates (grep sched_yield), the following screen is a snippet from single instance Java. Note the sched_yield rates (3rd column) get as high 90K per second. Figure 29 sched_yield rates (before)
... PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: ... sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield 704456 447103 854681 3228 427631 80744 125310 10114 86729 20274 6122 4212 480955 459844 368195 9701 107817 322367 918902 3603 704456 1151559 2006240 2009468 2437099 2517843 2643153 2653267 2739996 2760270 2766392 2770604 3251559 3711403 4079598 4089299 4197116 4519483 5438385 5441988 71157.1 45161.9 85468.1 322.8 43195.0 8074.4 12786.7 1011.4 8672.9 2047.8 651.2 405.0 48581.3 46448.8 36454.9 979.8 10890.6 32236.7 92818.3 360.3 35222.8 38385.3 50156.0 40109.1 40618.3 35969.1 33080.7 29480.7 27372.5 25093.3 23149.7 21312.3 23225.4 24759.1 25497.4 24054.7 23317.3 23786.7 27205.5 25914.2 0.59 0.40 1.41 0.01 0.37 0.10 0.12 0.02 0.09 0.05 0.03 0.02 0.43 0.51 0.34 0.05 0.10 0.31 0.78 0.02 0.59 1.00 2.41 2.42 2.79 2.90 3.01 3.04 3.13 3.18 3.21 3.23 3.65 4.17 4.50 4.55 4.65 4.96 5.73 5.76
The following screen is a snippet after switching to multi-instance/psets. Figure 30 sched_yield rates (after)
... PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: PS: ... sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield sched_yield 716 475 443 4217 6101 17678 1230 8189 9697 26599 3372 335 9028 1821 1037 1571 16607 734 5939 3713 2887 3362 3997 8214 14315 31993 33223 41412 51148 77747 81119 81575 90603 92424 93461 95032 111639 112373 118312 122025 146.1 95.0 90.4 843.4 1245.1 3535.6 246.0 1671.2 1939.4 5319.8 688.1 68.3 1805.6 371.6 207.4 314.2 3389.1 149.7 1187.8 757.7 24.0 26.8 29.5 58.5 98.5 212.8 213.9 258.5 300.3 443.5 450.1 428.8 464.1 461.6 455.2 451.8 518.7 510.3 525.3 530.0 0.08 0.10 0.09 0.11 0.06 0.02 0.05 0.08 0.08 0.03 0.00 0.06 0.13 0.01 0.01 0.04 0.01 0.00 0.08 0.00 1.45 1.56 1.67 1.78 1.85 1.87 1.91 1.99 2.08 2.11 2.11 2.18 2.31 2.32 2.33 2.37 2.38 2.39 2.47 2.47
As you can see, sched_yield rates are substantially reduced after switching to multiple Java instances running in psets, indicating reduced lock contention. Finally, figures 31 and 32 compare scavenge durations before and after optimization. Even though scavenge GCs are quick and were not causing a major problem, note the first screenshot shows occasional higher scavenge times, as high as 500ms. In the second screenshot, the scavenges are more uniformly under 150ms, with only a few hitting 250ms.
33
34
Other Factors
This section discusses other factors that can affect the performance of your java application.
OS Scheduler
The scheduling policies in the OS can have a significant impact on heavily multi-threaded Java applications. By default, the OS scheduling policy is SCHED_HPUX (or SCHED_TIMESHARE), where the priority value of the thread is decayed over time as the thread consumes processor cycles and boosted when the thread waits for processor cycles. If your application is experiencing performance problems due to heavy lock contention among threads, you might see some improvement by changing the scheduling policy to SCHED_NOAGE. The priority value of a thread executing with the SCHED_NOAGE policy is not decayed or boosted by the operating system scheduler. To switch the scheduling policy to SCHED_NOAGE, use the following JVM option: XX:SchedulerPriorityRange=SCHED_NOAGE
Hyper-threading
If you are migrating your application to Intel(R) Itanium(R) processor 9300 series, you might see some performance gain by enabling hyper-threading. Whereas enabling hyper-threading on the older 9100 series did not see much performance benefit, improvements to thread-switching decisions in the 9300 series make hyper-threading perform better. Whether or not you see a benefit depends on the application, so experimentation is required. To turn on hyper-threading, do the following: 1. Enable hyper-threading on your machine: $setbootmon $reboot 2. Dynamically turn on the hyper-threading tunable: $kctunelcpu_attr=1 For more details, refer to the setboot and lcpu_attr man pages.
ForceMmapReserved Some applications will see improved performance by using the option: XX:+ForceMmapReserved This option tells the JVM to reserve the swap space for all large memory regions used by the JVM (Java heap).
Java Exceptions Exception handling in Java is very expensive, and processing an excessive number of exceptions will cause non-optimal application performance. When migrating your application from another platform, errors in deployment can cause exceptions to occur. For example,
35
configuration files may require updating - if a config file points to an old IP address or missing file, an exception will occur. Excessive exceptions can also be caused by nonoptimal programming, where exceptions are used for control logic. You can use HPjmeter to see the number of exceptions and where they are occurring: 1. Collect Xeprof profile data. 2. Open file in HPjmeter (File->Open File) 3. Select Estimate->Exceptions Thrown This pops up a box with "Exceptions Thrown" showing the number of exceptions and in what methods they occur. 4. Click on the <method name>, and click button "Mark to Find" 5. Bring up Call Graph Tree: Select Metrics->Code/CPU->Call Graph Tree (Call Count) 6. Find Method in Call Graph Tree: Select Edit->Find Immediately 7. Expand method (click +) to see type of exception being thrown. If these exceptions are the result of an unexpected deployment error, then you can fix the error. If it is not possible to remove the source of the exception, then use the following option to suppress the filling in of the stack trace by the JVM when an exception occurs, thereby reducing some of the performance hit from exception handling: -XX:-StackTraceInThrowable
System Components
There are other system components that could be causing bottlenecks in overall system performance, thereby hiding application server performance. These include the database, network, and file system. You can use HPjmeter and glance as an indicator of whether these components are affecting performance. Then, use other tools to monitor the database, network, and file system, and tune them if necessary.
36
(-Xmx, -Xms) (see discussion on defaults in the Key Factors Affecting Performance section.)
Permanent space (mmap region) (-XX:PermSize, -XX:MaxPermSize) (defaults: 16m, 64m) Code Cache (mmap region) - contains runtime compiled code (-XX:ReservedCodeCacheSize) (IA default: 64m, PA default: 32m) C heap (data region) - contains JVM C/C++ data structures Java application thread stacks (mmap region) [multiply by number of threads] (-Xss) (IA default: 512k**, PA default: 512k) ** see clarification in the Threads in the Java Process section below. main stack (stack region) text (text region) shared libraries internal JVM thread stacks (mmap region) see Threads in the Java Process section below
Columns 6 and 7 show VSS (3.5GB) and RSS (3.4GB) of the entire process. Columns 8 and 9 show VSS (77MB) and RSS (77MB) of the data segment (C heap). From this data, you can observe the memory usage over time and whether the process stabilizes. You can also see how much of the memory usage is due to C heap (JVM structures) versus Java heap and other memory-mapped regions. To look at the memory regions in more detail, use gpm: 1. Start up gpm 2. Select Reports->Process List (brings up process screen) 3. Select java process from process screen 4. From process screen, select Reports->Memory Regions 5. Sort Memory Regions by VSS (Select Configure->Sort Fields) click on VSS, click on left most column header, click "done"
37
6. Top half of display gives a summary. Bottom half lists all memory regions. The java heap regions will appear towards the top (if you sorted by VSS). If you are migrating from Solaris or AIX, you may be accustomed to using pmap (Solaris) or procmap (AIX) commands to look at size of the memory regions. The pmap command is also available on HP-UX 11.31.
On PA, the default is 512k. On Itanium, the stack region contains 2 parts - the memory stack and the register stack. The default stack region is 512k of which 256k goes to the memory stack. The -Xss parameter specifies the size of the memory stack only. When -Xss<size> is specified, the JVM will double <size>, so the actual stack region will be 2*<size>. For example, -Xss512k results in a stack region of 1M.
Internal JVM threads include the following: vm thread, compiler threads, parallel GC threads, watcher thread, ... To print the default sizes of the thread stacks, use the option: -XX:+ThreadPriorityVerbose On Java 5.0 and java 6 default vm thread stack size is 1M default compiler thread stack size is 4M In a production environment, for a long-running server application, we generally do not recommend modifying the internal JVM thread stack sizes. For testing purposes, if you want to experiment with reduced thread stack sizes, see the next section "Reducing Starting Footprint".
For a large, long running server application, the starting footprint or memory usage is not as important as the longer-term steady-state memory usage. However, sometimes an application deployment uses many little Java processes all running on one machine. In such cases, the starting footprint of the java process becomes critical. To lower the footprint of your java process on Itanium, you can experiment with these options:
Minimize -Xms parameter; allow -Xmx to be larger to accommodate the few processes that will require a larger heap.
38
lower PermSize to a minimum (enough to accomodate startup) Reduce code cache size. For example: -XX:ReservedCodeCacheSize=32m Reduce java thread stack size to minimum possible; watch for stack overflow: -Xss200k Reduce number of threads in application. Reduce VM thread stack size to 512k: -XX:VMThreadStackSize=512 Reduce CompilerThread stack size to 1m: -XX:CompilerThreadStackSize=1024
As mentioned in the Key Factors Affecting Performance section, defaults on Solaris or AIX may differ from HP-UX. If there is a large difference in footprint when moving to HP-UX, most likely the settings of the above parameters are very different on the platforms, and need to be adjusted.
39
Call to Action
HP welcomes your input. Please give us comments about this white paper, or suggestions for HP-UX Java or related documentation, through our technical documentation feedback website:
http://www.hp.com/bizsupport/feedback/ww/webfeedback.html
2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. 59000594, March 2010
40