If you are a sysadmin, sometimes you might be faced in some situation where disk I/O is playing a Villon role in overall system performance(especially in DB systems.). There are verity of reasons for that starting from disk issues to HBA driver issues which we cannot predict. But monitoring and analyzing disk performance is a major role for a sysadmin to avoid any system performance degrade.
The primary tool using to analyse disk performance issues is iostat, also sar -d provide historical performance data along with third party tools called
dtrace for solaris 10 servers.
device- disk details of the server
r/s - read per second
w/s - write per second
kr/s - kbytes read per second
kw/s- kbytes written per second
wait - Average number of transactions that are waiting for the service (queue length)
actv - Average number of transactions that are actively being served .
svc_t - Average service time in milliseconds
%w - Percentage of time that queue is not empty
%b - Percentage od time that disk is full busy.
In the above output if svc_t (service time) values is more than 20 ms on the disks that are in use, we can consider it as a sluggish performance.
But now a days for the disks with large amount of cache, it is always advisable to monitor the service time intermediately if disk is not busy also.For example
if the writes and reads on a fiber attached disk cache is increased it will cause the service time to be increased 3-5 ms.
If we consider %b value in above output, if the disk is showing continuously 60 % utilization for a period of time, we can consider as the disk is saturated. Whether the application is really impacted due to this disk utilization can be evaluated by using service time parameter from the above output.
We need to check the behavior of the disk I/O as whether it is random or sequential. Sequential I/O which involves while reading or writing large files or folders and it is bit speedy compared to random I/O. This behavior can be analysed by using sar -d command , for example if (blks/s) / (r+w/s) is < 16KB then I/O is random and if the same output is > 128KB then the I/O behavior is sequential.
*********************************************************************************
bash-3.00# iostat -eE
---- errors ---
device s/w h/w trn tot
cmdk0 0 0 0 0
sd0 0 0 0 0
nfs1 0 0 0 0
cmdk0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: VBOX HARDDISK Revision: Serial No: VB4d87fd3f-3f00 Size: 17.18GB <1717 9803648 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
sd0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: VBOX Product: CD-ROM Revision: 1.0 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
*********************************************************************************
Possible solutions for disk I/O problems are given below
1. check the file system kernel parameters to make sure that inode caches are working properly
2.Spread the I/O traffic along with multiple disks (especially with RAID setup or ZFS)
3. Redesign the problematic process to reduce the number of disk I/O's (especially by cachefs or application based cache)
4. Setting the proper write throttle value . For example if the ufs_write value is set to 1 ( as default) if the no if writes exceeds ufs_HW then writes are suspended untill the no of writes reaches up to ufs_LW.
(ufs_writes -- if this value is non zero the number of bytes outstanding for the writes in a file will be checked . ufs_HW--- Number of bytes outstanding on a single file maximum value. ufs_LW---when the writes completed and number of writes is less than this value, all the pending (sleeping) process will be awaken and start writes)
5. Database I/O should be done to row disk partitions ( please avoid NFS )
The primary tool using to analyse disk performance issues is iostat, also sar -d provide historical performance data along with third party tools called
dtrace for solaris 10 servers.
iostat -xn output
device- disk details of the server
r/s - read per second
w/s - write per second
kr/s - kbytes read per second
kw/s- kbytes written per second
wait - Average number of transactions that are waiting for the service (queue length)
actv - Average number of transactions that are actively being served .
svc_t - Average service time in milliseconds
%w - Percentage of time that queue is not empty
%b - Percentage od time that disk is full busy.
In the above output if svc_t (service time) values is more than 20 ms on the disks that are in use, we can consider it as a sluggish performance.
But now a days for the disks with large amount of cache, it is always advisable to monitor the service time intermediately if disk is not busy also.For example
if the writes and reads on a fiber attached disk cache is increased it will cause the service time to be increased 3-5 ms.
If we consider %b value in above output, if the disk is showing continuously 60 % utilization for a period of time, we can consider as the disk is saturated. Whether the application is really impacted due to this disk utilization can be evaluated by using service time parameter from the above output.
Disk saturation
The high disk saturation can be measured from %w value from iostat output. High disk saturation will slow the system performance as the number of process queued up will be increased. Ideally %w > 5 can be considered as high disk saturation . In this case setting sd_max_throttile to 64 will be helpful ( sd_max_throttile will determine how many job's are queued up in a single HBA and its default value is 256). Another reason for high %w is due to scsi devices precedence, low scsi ID devices have less precedence than high scsi ID devices.We need to check the behavior of the disk I/O as whether it is random or sequential. Sequential I/O which involves while reading or writing large files or folders and it is bit speedy compared to random I/O. This behavior can be analysed by using sar -d command , for example if (blks/s) / (r+w/s) is < 16KB then I/O is random and if the same output is > 128KB then the I/O behavior is sequential.
Disk Errors
iostat -eE will show the disk error details from last reboot and we need to consider below parameters while considering disk errors .*********************************************************************************
bash-3.00# iostat -eE
---- errors ---
device s/w h/w trn tot
cmdk0 0 0 0 0
sd0 0 0 0 0
nfs1 0 0 0 0
cmdk0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: VBOX HARDDISK Revision: Serial No: VB4d87fd3f-3f00 Size: 17.18GB <1717 9803648 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
sd0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: VBOX Product: CD-ROM Revision: 1.0 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
*********************************************************************************
Possible solutions for disk I/O problems are given below
1. check the file system kernel parameters to make sure that inode caches are working properly
2.Spread the I/O traffic along with multiple disks (especially with RAID setup or ZFS)
3. Redesign the problematic process to reduce the number of disk I/O's (especially by cachefs or application based cache)
4. Setting the proper write throttle value . For example if the ufs_write value is set to 1 ( as default) if the no if writes exceeds ufs_HW then writes are suspended untill the no of writes reaches up to ufs_LW.
(ufs_writes -- if this value is non zero the number of bytes outstanding for the writes in a file will be checked . ufs_HW--- Number of bytes outstanding on a single file maximum value. ufs_LW---when the writes completed and number of writes is less than this value, all the pending (sleeping) process will be awaken and start writes)
5. Database I/O should be done to row disk partitions ( please avoid NFS )
File system performance
When considering file system performance an important role will be file system latency which will impact I/O performance. Below are the main reasons for file system latency
1. Disk I/O wait - This will be short as 0 in the event of read cache hit. For a synchronous I/O event this can be altered by adjusting cache parameters.
2. File system cache misses- Missing of block,buffer,metadata,name look up caches will highly impact file system latency
3. File system locking - Most file system have file system locking and this will be a major impact in case of bigger files like database files .
4. Metadata updating - creations,deletions,updation of file extensions will cause extra latency allows for file system metadata.
As mentioned earlier file system caches have an important role in I/O performance. The major file system cache's are below
1. DNLC ( Directory Name Lookup Cache) - This cache looks up the vnode to directory path lookup information which will avoid directory lookup every time on fly.
2. Indode Cache - Stores the file metadata information ( like size/access time) etc.
3. Rnode Cache - This is stored in NFS clients regarding the information about NFS mount points.
4. Buffer Cache - This is the reference between physical metadata ( eg: block placement on the filesystem) and logical data which is stored in other caches .
Physical disk Layout
The disk layout for a hard drive includes below
1. boot block
2. Super block
3. Inode list - ( The number of inodes can be changed by mkfs command)
4. data blocks
Inode Layout
Each inode contains below information
1. File type,permission etc
2. Number of hardlinks to the file
3. UID
4.GID
5. byte size
6. array of block addresses
7. generation number ( incremented every time when the inode is re used)
8. access time
9. modification time
10. change time
11. Number of sectors
12. Shadow inode location ( which can be used with ACL)
So overall disk I/O performance includes lots of dependencies including application tuning , disk physical setup,different cache sizes etc and as a sysadmin we need to consider all these factors need to be tuned to improve I/O performance as nutshell.