Troubleshooting Steps For When Your Hard Disk Is Giving You Trouble

When we’re talking about server performance, one of the more difficult issues we run into is the disk performance troubleshooting. While CPU load and memory usage can both can be monitored quite easily, disk overload can have load peaks that are hard to see and over time greatly affect overall server performance. Before you go deeper into your server storage system performance, it's a good idea to first look to the basics: does the server have sufficient free storage space and inodes count? That can be checked with the commands ‘df -h’ and ‘df -ih’ as shown below:

root@serversuit:~# df -h
Filesystem                 Size  Used Avail Use% Mounted on
rootfs                      30G  1.3G   27G   5% /
udev                        10M     0   10M   0% /dev
tmpfs                      101M  144K  101M   1% /run
/dev/disk/by-label/DOROOT   30G  1.3G   27G   5% /
tmpfs                      5.0M     0  5.0M   0% /run/lock
tmpfs                      201M     0  201M   0% /run/shm

root@serversuit:~# df -ih
Filesystem                Inodes IUsed IFree IUse% Mounted on
rootfs                      1.9M   36K  1.9M    2% /
udev                        124K   275  124K    1% /dev
tmpfs                       126K   195  126K    1% /run
/dev/disk/by-label/DOROOT   1.9M   36K  1.9M    2% /
tmpfs                       126K     1  126K    1% /run/lock
tmpfs                       126K     2  126K    1% /run/shm

ServerSuit also has a widget you can add to track disk space usage.

Track disk space usage with ServerSuit

So, when is it a good time to check on your server disk performance? If get sudden lag spikes or high load average numbers on your server, or you can see a high wait average metric from your ‘top’ output, you should probably check advanced disk performance information. We can start with ‘iostat’ command at bash console (you’ll need the ‘sysstat’ package installed). Let’s look for the output:

root@serversuit:~# iostat -xcd
Linux 3.2.0-4-amd64 (serversuit)   04/28/2016      _x86_64_        (1 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.4 7   0.0   0.22    0.46     0.02     98.83

Device: rrqm/s  wrqm/s  r/s   w/s   rkB/s  wkB/s  avgrq-sz avgqu-sz await  svctm  %util
vda     20.70   47.44   0.91  12.01 26.18  237.91 40.89    2.35     181.49 7.76   10.03

This output will get you averaged stats from the time of boot of the server, which is not a silver bullet but can give you a basic understanding of what's been going on. To look for real-time data you’ll need to run ‘iostat -xcd -t 10’ command, which will return 10-seconds averages. You should pay attention to at least these metrics: rrqm/s - reads per second requested from your apps; wrqm/s - writes per second requested from the apps, which will give you your IOPS summary; r/s - actual reads from the storage device; w/s - actual writes from the storage device; await - average latency of all requests. If we’ll look at the numbers, we can draw some conclusions. Read requests are cached and were effectively merged: for 20.7 read IO requests only 1 IO request to the device was actually executed. Write requests were either random or can’t be cached, so for 47 write IO queries - 12 actual IO requests was executed. Average throughput was not huge, so there was probably some random IO and some low-level storage device underneath. Average latency is not good, so looks like we have a problem here. You’ll need to run ‘iostat’ with ‘-t’ key for some time to have real-time data about your storage load, so you can see IO peaks too, not just averages. When you have all the data you need, you can look for your current storage activity and probably have some graphs which will show you storage load peaks. It can be hard because you can’t say for sure which services are using disk resources with this tool. That’s where you can use the ‘iotop’ utility: this tool can help you to look for processes and their IO activity in real-time:

Total DISK READ:       0.00 B/s | Total DISK WRITE:      23.42 K/s
  137 be/3 root        0.00 B/s   15.62 K/s  0.00 %  0.05 % [jbd2/vda1-8]
30040 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % ssh 
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init [2]
    2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
    3 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
 2052 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % sh /usr/bin/mysqld_safe
    5 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/u:0]
    6 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
    7 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/0]

Overall disk performance troubleshooting is relatively difficult for every system administrator, but hopefully this handy guide gave you some useful options for troubleshooting. As mentioned before, to ease your administrative burden ServerSuit has basic disk monitoring available. However, we're currently working a a feature set to fully troubleshoot disk performance, so look for that in an upcoming release in the near future! 

Until next time!

May 03 2016

Add or review comments

Please leave your comment

Existing comments

Comments 0