Monday, June 8, 2015

Common performance issues and drill down

Every time we run a performance test we want to catch any performance issue in the application through monitoring it. Many a times we rely on generated reports post execution to analyze performance of the application.
If you are experienced software application performance testing professional you will agree that we cannot conclude where the performance bottleneck is based on just one or two result metrics. We may need to correlate large number of metrics and results to form an opinion about a bottleneck.
Since there are many performance metrics (depend on the scope of testing and complexity of an application) we don’t know how to analyze the results.
Following are some of the approaches and techniques we can use to find the root cause of performance bottleneck in a software application.
Symptom – Periodic High response time
1)      Correlate with Available thread count
If Available thread count – Low 
Reasons may be –
 
Frequent GC leaks
Backend choke point 
So if available thread count is low that means the threads which are getting executed are taking more than usual time to complete processing and hence free threads are getting grabbed.
Two of the prominent reasons for threads taking more time to execute than normal are Frequent GC leaks (threads which are taking longer for processing are not getting killed or there are deadlock issues for shared resources which is stopping GC from killing these threads) and Backend is slow (There may be some database server or a third party server which is taking time to process).
We can conclude above point if we look in threads metrics of any performance monitoring tool.
Following snapshot from dyna trace can be helpful.
             We can check if number of busy threads are getting increased over time.
 


 *Threads – Threads are light weight processes which share memory and code among themselves and hence are lower on resource consumption compared to processes. Thread pool is collection of threads of a process to service a particular functionality. The more the threads in the pool the more calls to that functionality application. can service, but as this number increases response time increases since these threads compete for shared resources. Hence there should be optimized number of threads on a server. The metrics which we generally look for are Pool size, Free Pool size and Waiting thread count.
Pool size – Total number of threads allowed to be created. Active threads.
Free pool size – Number of threads which are idle. Number of available threads.
Waiting thread count – Number of threads which are in execution. Number of threads in use. 
Typically there are following types of thread pools on IBM WebSphere application server –
a)      WebContainer – For HTTP requests
b)      ORB – RMI/IIOP requests
c)       Messaging thread pool
 There are two kinds of threads based – 
a)      Worker threads – These are the code processing threads and require operating system services.
b)      I/O threads – These are threads which need device driver services.
 
2)      Correlate with CPU utilization
If CPU utilization – Low
Reason may be –
Internal chokepoint
There are two kinds of processes in any application.
CPU bound – In these kind of processes, processor services are required to process the code.
I\O bound – In the kind of processes, device driver services are required to send data to or out of system.
If we have more number of CPU processes in the application than CPU utilization tends to be higher.While if more of the processes do I\O activities then average CPU utilization will be low, since CPU will be idle in those cases.
Now if we see that we have low Average CPU utilization but see response time spikes this may be because of high I\O processes or threads. There may be issues in application code which certainly is making application wait for non CPU bound processes to complete.
Symptom – Consistent High response time
Correlate with – Low available thread count
Reasons may be –
Inefficient code
Overuse of external system
Slow backend
Too many layers
If we observe high response times consistently and also see that available thread counts are low, this means threads are taking more time to process indicating any bottleneck in the down system. There are many reasons for low available thread count like inefficient code. Coders may have kept the number of threads low anticipating less service calls to a particular functionality or the threads are not killed after they have waited for long for an event to occur. Another reason may be overuse of external down system. This system may have a capacity issue. Slow database server is another reason which is keeping threads busy overtime. Threads may be waiting for database connection for longer due to database or network issues. Or there can be a case where too much layering is in application design. We may need to drill down these reasons further to blame the issue on any particular reason.
Shown below is high level diagram of a 3-tier application. 
 
 
Another reason for consistent high response time may be Network bottleneck.
If everything at server side looks good this is the only culprit which needs to look at for any issues.
Following diagram show the Network time vs Transaction response time.
 Now the above diagram is a simplified view of request and response. Actual request and response is a collection of data packets (in case of TCP) as you know data on networks travel as packets.
Request which the client sends to a server may consists of more than one packet as well response from server may also be made of more than one packet of data. In that case we have to further breakdown our generalized diagram above to something as shown below.
 
Now client wants all the requests to reach server, processing by server and complete response to be reached back to it quickly. So for client the importance of time starts when it sends the first packet of data and received the last packet of response.
But for server the time between last request packet received and first response packet send is important because by that time it has finished the service it is asked for and now it is networks liability to make this complete response to be available to client.
 
Time taken by Request or response packets to reach server and client respectively depend on many factors like –
1) Bandwidth – How much data in bits can be transferred to and fro on a network medium. In today’s networks bandwidth is typically measured in millions of bits per second. It is a physical layer property of network models. For example – If we say our computer is connected to network using 56 kbps dialup modem this means that the bandwidth of this medium is 56 kbps.
2) Latency – Latency is how much time it takes for a packet of data to get from one designated point to other. In some cases round trip time (packet send and received at source) is considered as latency.
You can get a rough idea of latency using PING command in windows.
3) Load
If you see that all the server metrics are healthy and there is no other reason like slow backend system or slow database server than look into the performance of your network. We can analyze impact of network on our application using different virtualization tools. One such tool is HP Shunra through which we can virtualize our network conditions and analyze impact of same in our application performance.
Correlate with –
Symptom – Progressive High response time
1)      Correlate with – Low TPS
Reasons may be –
Memory leak
Memory leak is a condition where application do not return memory which it will not use any more. There are many bad coding ways through which it can happen. Memory leaks can be detected by looking into memory utilization graphs where utilization keep on increasing over time.
There may be many causes of memory leak like non recovery of objects which lost reference during execution of code, cleanup errors or inappropriate session handling for web applications.
If Memory leakage continues it will result in lower transactions being processed by the application over time. Eventually if system continues to loose memory application will crash after a point of time.
In the following graph you will see both the JVMs crashed over a time span after system goes on loosing memory.
We can see memory on the server using different ways or tools. On simple case is windows servers where we can use windows in build resource monitor to watch the memory.
Go to run and write resmon.exe and press enter. Go to Memory tab. Click on Private(KB) column of the Processes table and you will get in Process name decreasing order based on their Memory consumption. From here we get single out processes which are taking more Memory.
 
 
Processor bottleneck
Check the %Processor time and processor queue length graphs.
%Processor time gives the average utilization of all the processors on a windows server. Now if it is below 70 or 80% it is considered a good value (depends on application to application but this is most general ideal case). If %Processor time remains above this consistently this represents a performance bottleneck and we may need to add new hardware. There is another metrics on a windows server which convey the CPU performance issue i.e. Processor queue length.
Processor queue length represents the number of threads waiting services of processor at any particular point of time. Ideally this value should be 2 times the number of processors and if it is more than 2 it shows that threads are waiting for longer than ideal time for processing due to burden on processor.
There are different ways on different operating systems to see processes which are using more processing time. In simple case on a windows server we can use resource monitor to see this.
Go to run and write resmon.exe and press enter. Go to CPU tab. Click on CPU column of the Processes table and you will get in Process name decreasing order based on their CPU consumption. From here we get single out processes which are taking more CPU. Below is the snapshot -
 
 Resources –