One of the first posts which I published in the blog was about acceptable I/O latencies. I have recently received a question from a reader related to this topic that I would like to share with you. It a great example to the point I was trying to make in that post.
The reader wrote (I edited the original email to maintain the reader’s privacy):
Hello. I have just read your article about acceptable I/O latencies at http://www.theiostorm.com/whats-an-acceptable-io-latency/ and I think I need your advice.
Apache2 Web server dispatching quite short streamed audios from disk and connecting to remote MySQL. Such machine relies on an Amazon EC2 large instance which has 1 EBS (Elastic Block Device – Disk) for system and another one for data (all the Apache content).
Apparently, the tool munin is reporting quite strangely high DISK LATENCIES
Average of 10.37ms for system disk
Average of 156.22ms for data disk
I am a bit lost because ‘top’ does not actually report a high I/O wait cpu usage ::> Average of 1%
So, do I believe top and seek for the culprit, or other place, or do I believe munin and think My Amazon EBS volumes are guilty of my suffering?
Thanks in advance for your answers and for your article too.
Obviously, the reader is not sure whether to trust the output from munin which shows such high disk latencies. Yet, he assumes that if munin output is correct, he managed to find the problem thus improving the disk latencies will solve the performance problem. He asked for my help to verify if the disk latencies are really that high.
At first, reading the email it seems like a classic disk bottleneck with extreme latencies. I requested to look at iostat and vmstat output. Below is a snippet from iostat output:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvdap1 0.00 0.00 0.20 3.80 1.60 30.40 8.00 0.02 5.00 0.60 0.24
xvdf 0.00 29.20 1.00 26.00 49.60 441.60 18.19 1.89 69.87 3.35 9.04
The iostat output confirmed that munin reported valid numbers. Indeed the latencies are very high (69 ms in this case). The latencies are higher than the recommended latency for most systems. But are these high latencies acceptable? The iostat also revealed that the number of IOPS is very low (1 Read/Sec and 26 Write/Sec) and the queue length is short. Vmstat output (omitted here) also confirmed that I/O Wait is low.
This is a classic example of a poor performing storage (compared with high performance storage such as Kaminario K2) which is not a bottleneck. If an application does not demand I/O from the storage than a slow storage device will be sufficient. It is as if you have a slow car but since you never drive – it is not an issue.
I am sure that our reader suspected the disk latencies are the bottleneck mainly because there was a performance problem. We see the storage as the cause of the many performance problems in many cases, but in this case it was not. I suspect that the slow response time is contributed to the MySQL requests or something along the transaction path, but not to the storage on this server. If it was a storage I/O bottleneck we would have seen high amount of IOPS and/or high throughput. When examining high latency, I always ask myself “what will we gain if we cut the latency by half?” In the case above we could gain by average 30 ms, but since there was only 1 Read/Sec we would save only 30 ms on read operations. Same math for write – we could reduce 0.78 second over the entire application (many Apache processes). When the bottleneck is storage, I expect to see higher numbers for time saved.