Modern JVMs perform many so-called ‘adaptive’ optimizations based on current workload. This means that performance test data for short tests (1 hr or less) tend to be unreliable as predictors for long-term application behavior.
The solution consists of longer duration tests (>8 hours), as well as discarding the first hour of test results. Using 95th percentile numbers rather than Arithmetic Means is also a good practice. The Arithmetic Mean (average) does not give us an idea of what the majority of users are experiencing. It is also influenced by the presence of very large and very small results (outliers). Shared services are advised to use 99th percentile numbers (Amazon is reported to do this internally for its Dynamo infrastructure).
Below is the scatter plot of elapsed times for the eTech LiveCycle/ADEP Benchmark short-lived orchestration for the first hour of a 12-hour test with 10 concurrent users. You can see clearly that about 20 minutes into the test, observations stabilized around 1.5 seconds. They then crept up towards 2 seconds and tended to stabilize there. This behavior is explained by the fact that the 10 concurrent users were ramping up at the rate of 1 additional concurrent user every 5 minutes. At the 50-minute mark, all concurrent users were operating.
This is a 64-bit Oracle HotSpot 1.6.0_26 JVM. We know that IBM’s J9 JVM exhibits similar behavior also.