Showing posts with label solaris. Show all posts
Showing posts with label solaris. Show all posts

Saturday, March 19, 2016

VCS ( Veritas Cluster Server)

Veritas cluster server is the high availability cluster solution for open platform  including Unix and MS.  This can be used as high availability cluster and high performance cluster depending up on the requirement where high availability cluster (HAC) improve application availability by failing them over or switching them over in a group of systems  and high performance cluster (HPC) which improve application performance by running them on multiple systems simultaneously.

VCS consist of multiple systems connected to a shared storage in different formations. VCS monitors and controls applications (resources) running in the system  and restarts it by switching to defferent nodes to ensure minimum or zero downtime.
















I am mentioning different VCS terms which will be used through out this blog for your understanding
Resources - Resources includes hardware and software entities which includes ip address, disks, scrips , network interfaces etc. controlling this resources means starting/stopping/monitoring the resources as per the requirement . These resources are controlled by resource dependencies where there will be specific procedure to make this resources online or offline.

  Resources have 3 categories
on-off - These type of resources will be on or off with in the cluster as per the requirement . example scripts
on-only - This resources will be always on and VCS cannot stop them . example os NFS file systems
persistent - Persistent resources will not be make online or offline  by VCS.  example is network interface.

Service groups - Service groups are logical group of resources and dependencies  for better management. Each node in the VCS might host multiple service groups which can be monitored separately. If any resource in a service group is failed VCS will restart the service group in same node itself or by moving to other node.

service groups are 3 types called

Failover Service group - running in one node at a time

Parallel Service group - running in multiple nodes at a time

Hybrid Service group - This is a combination of Failover & Parallel  where with in the system zones it will act as Failover and Parallel across the zones .

Also there is one more service group called cluster service group which will contain main cluster resources like cluster manager,Notification,Wide area connectors (WAC) etc. This service group can switch to any nodes and it is the first service group came as online






























High Availability Daemon (HAD)

HAD is the main daemon running in the VCS system which is responsible for running the cluster with cluster configuration , it will distribute the information when the nodes will join/leaves the cluster also it will respond to operator input and take corrective action when something fails . This daemon is generally known as VCS engine. HAD will use agents to understand the status of the resources  and the daemon running in one system will replicate the details to all other nodes in VCS system

Low Latency Transport (LLT)

LLT provides faster communication between the cluster nodes which means kernel to kernel communication. LLT has two functions

Traffic distribution - LLT helps to load balance the traffic between the nodes and it supports maximum up to eight links. If one link is down communication will shift automatically to another node.

Heart Beat - LLT is responsible for sending and receiving heart beat communication between the nodes to make sure system health status is fine.This heartbeat is analysed by GAB to identify the system status  

Main configuration files - /etc/llthosts & /etc/llttab

1. /etc/llthosts - This file contains hosts details with respective id's

ex: bash-3.00# cat /etc/llthosts
0 solaris-test2
1 solaris-test3


2. /etc/llttab - This file contains LLT network links related to each nodes in the system

ex:
bash-3.00# cat /etc/llttab
set-node solaris-test2
set-cluster 100
link e1000g0 /dev/e1000g:0 - ether - -
link e1000g1 /dev/e1000g:1 - ether - -


Group Membership Services/Atomic Broadcast (GAB)

GAB provides a global message order to provide synchronized system state. This also monitors disk communications  which is required for VCS heartbeat utility .  The main functions of GAB is to maintain cluster membership ( by receiving input of the status of the servers through LLT) and helps for reliable cluster communication

Main configuration file /etc/gabtab

This file contains the information needed by gabdriver and it is accessed by gabconfig utility
ex:
bash-3.00# cat /etc/gabtab
/sbin/gabconfig -c -n2


where 2 is the number of nodes available in the cluster system 

Configuring disk heartbeat 

This is same as qdisk configuration in RHEL cluster. We require at least 64kb of a disk for this configuration 

Below commands will initialize the disk region (s - start block, S- signature)

#/sbin/gabdiskconf - i /dev/dsk/c1t2d0s2 -s 16 -S 1123

#/sbin/gabdiskconf - i /dev/dsk/c1t2d0s2 -s 144 -S 1124

Adding the disk configuration 

#sbin/gabdiskhb -a /dev/dsk/c1t2d0s2 -s 16 -p a -s 1123

#sbin/gabdiskhb -a /dev/dsk/c1t2d0s2 -s 144 -p h -s 1124

configure the disks to use the driver and initialize 

#sbin/gabconfig -c -n2

LLT commands

lltstat -n command will show the llt link status in each node

in server1-
******************************************************
bash-3.00# /sbin/lltstat -n
LLT node information:
    Node                 State    Links
   * 0 solaris-test2     OPEN        2
     1 solaris-test3     OPEN        2

**************************************************************
in server2
*****************************************************
bash-3.00# /sbin/lltstat -n
LLT node information:
    Node                 State    Links
     0 solaris-test2     OPEN        2
   * 1 solaris-test3     OPEN        2

***********************************************************
lltstat -nvv |more command will show in verbose format 
***************************************************
bash-3.00# /sbin/lltstat -nvv
LLT node information:
    Node                 State    Link  Status  Address
   * 0 solaris-test2     OPEN
                                  e1000g0   UP      08:00:27:F2:FD:1B
                                  e1000g1   UP      08:00:27:A4:52:AF
     1 solaris-test3     OPEN
                                  e1000g0   UP      08:00:27:19:48:CB
                                  e1000g1   UP      08:00:27:76:37:96
     2                   CONNWAIT
                                  e1000g0   DOWN
                                  e1000g1   DOWN
     3                   CONNWAIT
                                  e1000g0   DOWN
                                  e1000g1   DOWN

***************************************************
To start & stop llc  use below commands 
***************************************************
#lltconfig -c

#lltconfig -U  ( Keep it in mind that GAB has to be stopped  before this command) 

GAB commands 

1. Define the group membership and atomic broadcast (GAB) configuration

# more /etc/gabtab
/sbin/gabconfig -c -n2

2. start GAB

# sh /etc/gabtab
starting GAB done.

3. Display the GAB details 

bash-3.00# /sbin/gabconfig -a
GAB Port Memberships
===============================================================
Port a gen    f2503 membership 01
Port h gen    f2505 membership 01


VCS configuration 

Configuring VCS means communicating VCS engine about the name & definition of the cluster,service groups, and resource dependencies  The VCS configuration file is located /etc/VRTSvcs/conf/config named main.cf & types.cf .Here the main.cf file will define entire cluster where types.cf includes resource types .In VCS the first system comes online will reads the configuration file and keeps the entire configuration in memory and system which are online later will derive the information . 

sample main.cf file

********************************************************************************
bash-3.00# cat /etc/VRTSvcs/conf/config/main.cf
include "OracleASMTypes.cf"
include "types.cf"
include "Db2udbTypes.cf"
include "OracleTypes.cf"
include "SybaseTypes.cf"

cluster unixchips (
        UserNames = { admin = dqrJqlQnrMrrPzrLqo }
        Administrators = { admin }
        )

system solaris-test2 (
        )

system solaris-test3 (
        )

group unixchipssg (
        SystemList = { solaris-test2 = 0, solaris-test3 = 1 }
        AutoStartList = { solaris-test2 }
        )

        DiskGroup unixchipsdg (
                Critical = 0
                DiskGroup = unixchipsdg
                )

        Mount unixchipsmount (
                Critical = 0
                MountPoint = "/vcvol1/"
                BlockDevice = "/dev/vx/dsk/unixchipsdg/vcvol1"
                FSType = vxfs
                FsckOpt = "-y"
                )

        Volume unixchipsvol (
                Critical = 0
                Volume = vcvol1
                DiskGroup = unixchipsdg
                )

        unixchipsmount requires unixchipsvol
        unixchipsvol requires unixchipsdg


        // resource dependency tree
        //
        //      group unixchipssg
        //      {
        //      Mount unixchipsmount
        //          {
        //          Volume unixchipsvol
        //              {
        //              DiskGroup unixchipsdg
        //              }
        //          }
        //      }

*********************************************************************************
Types. cf file
****************

bash-3.00# cat /etc/VRTSvcs/conf/config/types.cf
type AlternateIO (
        static str AgentFile = "bin/Script51Agent"
        static str ArgList[] = { StorageSG, NetworkSG }
        str StorageSG{}
        str NetworkSG{}
)

type Apache (
        static boolean IntentionalOffline = 0
        static keylist SupportedActions = { "checkconffile.vfd" }
        static str ArgList[] = { ResLogLevel, State, IState, httpdDir, SharedObjDir, EnvFile, PidFile, HostName, Port, User, SecondLevelMonitor, SecondLevelTimeout, ConfigFile, EnableSSL, DirectiveAfter, DirectiveBefore }
        static int ContainerOpts{} = { RunInContainer=1, PassCInfo=0 }
        str ResLogLevel = INFO
        str httpdDir
        str SharedObjDir
        str EnvFile
        str PidFile
        str HostName
        int Port = 80
        str User
        boolean SecondLevelMonitor = 0
        int SecondLevelTimeout = 30
        str ConfigFile
        boolean EnableSSL = 0
        str DirectiveAfter{}
        str DirectiveBefore{}
)
................. output is omitted 
*********************************************************************************

The setup



As per the above architecture diagram we have 2 solaris servers and openfiler storage for the sample set up. Here i am not covering the installation part which will be explaining in a separate post by. soon.  The sample service will be a vxfs mount point which is configured as a service in VCS setup .

1. Once the HA installation is completed we will get the status as both the nodes are online mode as given below 

bash-3.00# hastatus -sum

-- SYSTEM STATE
-- System               State                Frozen

A  solaris-test2        RUNNING              0
A  solaris-test3        RUNNING              0

 Now we need to create the diskgroup using the vxfs file system . 


1. First check the available disks in the server using below command  ( identical output for both the nodes) 

bash-3.00# echo |format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c0d0 <DEFAULT cyl 2085 alt 2 hd 255 sec 63>
          /pci@0,0/pci-ide@1,1/ide@0/cmdk@0,0
       1. c0d1 <VBOX HAR-34e30776-506a21a-0001-1.01GB>
          /pci@0,0/pci-ide@1,1/ide@0/cmdk@1,0
       2. c1d1 <VBOX HAR-b00f936a-669c3f0-0001-1.01GB>
          /pci@0,0/pci-ide@1,1/ide@1/cmdk@1,0
       3. c2t0d0 <VBOX-HARDDISK-1.0-1.17GB>
          /pci@0,0/pci1000,8000@14/sd@0,0
       4. c2t1d0 <VBOX-HARDDISK-1.0-1.06GB>
          /pci@0,0/pci1000,8000@14/sd@1,0
       5. c3t3d0 <OPNFILER-VIRTUAL-DISK-0 cyl 44598 alt 2 hd 16 sec 9>
          /iscsi/disk@0000iqn.2006-01.com.openfiler%3Atsn.7d10272607c20001,0
       6. c3t4d0 <OPNFILER-VIRTUAL-DISK-0 cyl 63258 alt 2 hd 16 sec 9>
          /iscsi/disk@0000iqn.2006-01.com.openfiler%3Atsn.0a610808dede0001,0


bash-3.00# vxdisk list
DEVICE       TYPE            DISK         GROUP        STATUS
c0d0s2       auto:none       -            -            online invalid
c0d1         auto:ZFS        -            -            ZFS
c1t1d0s2     auto            -            -            error
c2t0d0       auto:ZFS        -            -            ZFS
c2t1d0       auto:ZFS        -            -            ZFS
disk_0       auto:cdsdisk    iscsi1       ----------------  online invalid 
disk_1       auto:cdsdisk    iscsi2       ----------------  online invalid 

As per the above command we can see 2 iscsi disks from the storage which is shown as online but invalid . These disks we will configure for the diskgroup .

2. Bring open filer disks under vxvm

bash-3.00 # /etc/vx/bin/vxdisksetup -i disk_0
bash-3.00# /etc/vx/bin/vxdisksetup -i disk_1

Now we can see both the disks are online status

3. bash-3.00# vxdisk list
DEVICE       TYPE            DISK         GROUP        STATUS
c0d0s2       auto:none       -            -            online invalid
c0d1         auto:ZFS        -            -            ZFS
c1t1d0s2     auto            -            -            error
c2t0d0       auto:ZFS        -            -            ZFS
c2t1d0       auto:ZFS        -            -            ZFS
disk_0       auto:cdsdisk    iscsi1       ----------------  online 
disk_1       auto:cdsdisk    iscsi2       ----------------  online

4. We have to create the diskgroup with thease online disks now

bash-3.00#vxdg init unixchipsdg iscsi1=disk_0 iscsi2=disk_1

We can see the disk group is created (unixchipsdg) for these iscsi disks

bash-3.00# vxdisk list
DEVICE       TYPE            DISK         GROUP        STATUS
c0d0s2       auto:none       -            -            online invalid
c0d1         auto:ZFS        -            -            ZFS
c1t1d0s2     auto            -            -            error
c2t0d0       auto:ZFS        -            -            ZFS
c2t1d0       auto:ZFS        -            -            ZFS
disk_0       auto:cdsdisk    iscsi1              unixchipsdg  online 
disk_1       auto:cdsdisk    iscsi2              unixchipsdg   online


5. We need to create the volume using this diskgroup

bash-3.00#vxassist -g unixchipsdg make vcvol1 2g

6. Format the volume to create the file system

bash-3.00#mkfs -F vxfs /dev/vx/rdsk/unixchipsdg/vcvol1

version 7 layout 6291456 sectors, 
3145728 blocks of size 1024, log size 16384 blocks 
largefiles supported

7.Now you can try to mount it for confirmation after creating the mount point 

bash-3.00#mkdir /vcvol1

bash-3.00#mount -F vxfs /dev/vx/dsk/unixchipsdg/vcvol1 /vcvol1

bash-3.00#df -h /vcvol1
Filesystem size used avail capacity Mounted on 
/dev/vx/dsk/unixchipsdg/vcvol12.0G 18M 1.8G 1% /vcvol1 

8. The next step is to create the service group/volume/diskgroup . The order of creation will be service group/volume/mount

Creating the servicegroup
**********************
# haconf -makerw  (this will make the configuration file ie main.cf to read write mode)
#hagrp -add unixchipssg
# hagrp -modify unixchipssg SystemList solar is-test2 0 solaris-test3 1
# hagrp -modify unixchipssg AutoStartList solaris-test2
# hagrp -display unixchipssg
#Group       Attribute             System        Value
unixchipssg  AdministratorGroups   global
unixchipssg  Administrators        global
unixchipssg  Authority             global        0
unixchipssg  AutoFailOver          global        1
unixchipssg  AutoRestart           global        1
unixchipssg  AutoStart             global        1
unixchipssg  AutoStartIfPartial    global        1
unixchipssg  AutoStartList         global        solaris-test2
unixchipssg  AutoStartPolicy       global        Order
unixchipssg  ClusterFailOverPolicy global        Manual
unixchipssg  ClusterList           global
unixchipssg  ContainerInfo         global
unixchipssg  DisableFaultMessages  global        0
unixchipssg  Evacuate              global        1
unixchipssg  ExtMonApp             global
unixchipssg  ExtMonArgs            global
unixchipssg  FailOverPolicy        global        Priority
unixchipssg  FaultPropagation      global        1
unixchipssg  Frozen                global        0
unixchipssg  GroupOwner            global

.........output will be omitted 

#haconf -dump ( to update the configuration in main.cf file)
#view /etc/VRTSvcs/conf/config/main.cf

# cat /etc/VRTSvcs/conf/config/main.cf
include "OracleASMTypes.cf"
include "types.cf"
include "Db2udbTypes.cf"
include "OracleTypes.cf"
include "SybaseTypes.cf"

cluster unixchips (
        UserNames = { admin = dqrJqlQnrMrrPzrLqo }
        Administrators = { admin }
        )

system solaris-test2 (
        )

system solaris-test3 (
        )

group unixchipssg (
        SystemList = { solaris-test2 = 0, solaris-test3 = 1 }
        AutoStartList = { solaris-test2 }
        )

Adding the Diskgroup
********************
# hares -add unixchipsdg DiskGroup unixchipssg
# hares -modify unixchipsdg Critical 0
# hares -modify unixchipsdg DiskGroup unixchipsdg
# hares -modify unixchipsdg  Enabled 1
# hares -online unixchipsdg -sys solaris-test2
# hares -state unixchipsdg
#Resource    Attribute             System        Value
unixchipsdg  State                 solaris-test2 ONLINE
unixchipsdg  State                 solaris-test3 OFFLINE

#haconf -dump

Adding the volume resource
***********************
# hares -add unixchipsvol Volume unixchipssg
# hares -modify unixchipsvol Critical 0
# hares -modify unixchipsvol Volume vcvol1  ( this is vxfs volume we have created earler )
# hares -modify unixchipsvol DiskGroup unixchipsdg
# hares -modify unixchipsvol Enabled 1
# hares -display unixchipsvol (display the volume status with attributes) 
#Resource    Attribute             System        Value
unixchipsvol Group                 global        unixchipssg
unixchipsvol Type                  global        Volume
unixchipsvol AutoStart             global        1
unixchipsvol Critical              global        0
unixchipsvol Enabled               global        1
unixchipsvol LastOnline            global        solaris-test3
unixchipsvol MonitorOnly           global        0
unixchipsvol ResourceOwner         global
unixchipsvol TriggerEvent          global        0
unixchipsvol ArgListValues         solaris-test2 Volume 1       vcvol1  DiskGroup       1       unixchipsdg
unixchipsvol ArgListValues         solaris-test3 Volume 1       vcvol1  DiskGroup       1       unixchipsdg
unixchipsvol ConfidenceLevel       solaris-test2 0
unixchipsvol ConfidenceLevel       solaris-test3 100
unixchipsvol ConfidenceMsg         solaris-test2
unixchipsvol ConfidenceMsg         solaris-test3
unixchipsvol Flags                 solaris-test2
unixchipsvol Flags                 solaris-test3
unixchipsvol IState                solaris-test2 not waiting
unixchipsvol IState                solaris-test3 not waiting
unixchipsvol MonitorMethod         solaris-test2 Traditional
unixchipsvol MonitorMethod         solaris-test3 Traditional

................................out put is omitted 

# hares -online unixchipsvol -sys solaris-test2
# hares -state unixchipsvol
#Resource    Attribute             System        Value
unixchipsvol State                 solaris-test2 ONLINE
unixchipsvol State                 solaris-test3 OFFLINE
#haconf -dump 

Adding the mount point resource
*******************************
# hares -add unixchipsmount mount unixchipssg
# hares -modify unixchipsmount Critical 0
# hares -modify unixchipsmount MountPoint /vcvol1
# hares -modify unixchipsmount BlockDevice /dev/vx/dsk/unixchipsdg/vcvol1
# hares -modify unixchipsmount FSType vxfs
# hares -modify unixchipsmount  FSCKopt %-y
# hares -modify unixchipsmount Enabled 1
# hares -display unixchipsmount
#Resource      Attribute             System        Value
unixchipsmount Group                 global        unixchipssg
unixchipsmount Type                  global        Mount
unixchipsmount AutoStart             global        1
unixchipsmount Critical              global        0
unixchipsmount Enabled               global        1
unixchipsmount LastOnline            global        solaris-test2
unixchipsmount MonitorOnly           global        0
unixchipsmount ResourceOwner         global
unixchipsmount TriggerEvent          global        0
unixchipsmount ArgListValues         solaris-test2 MountPoint   1       /vcvol1/        BlockDevice     1       /dev/vx/dsk/unixchipsdg/vcvol1  FSType  1       vxfs    MountOpt        1       ""      FsckOpt 1       -y      SnapUmount      1       0       CkptUmount      1       1       OptCheck        1       0       CreateMntPt     1       0       MntPtPermission 1       ""      MntPtOwner      1       ""      MntPtGroup      1       ""      AccessPermissionChk     1       0       RecursiveMnt    1       0       VxFSMountLock   1       1
unixchipsmount ArgListValues         solaris-test3 MountPoint   1       /vcvol1/        BlockDevice     1       /dev/vx/dsk/unixchipsdg/vcvol1  FSType  1       vxfs    MountOpt        1       ""      FsckOpt 1       -y      SnapUmount      1       0       CkptUmount      1       1       OptCheck        1       0       CreateMntPt     1       0       MntPtPermission 1       ""      MntPtOwner      1       ""      MntPtGroup      1       ""      AccessPermissionChk     1       0       RecursiveMnt    1       0       VxFSMountLock   1       1
...............................................output will be omitted

# hares -online unixchipsmount -sys solaris-test2
# hares -state unixchipsmount
#Resource      Attribute             System        Value
unixchipsmount State                 solaris-test2 ONLINE
unixchipsmount State                 solaris-test3 OFFLINE

Linking the resources to the service group
********************************
# hares -link unixchipsvol unixchipsdg
# hares -link unixchipsmount  unixchipsvol


9. Testing the resources 

checking the status

# hastatus -sum

-- SYSTEM STATE
-- System               State                Frozen

A  solaris-test2        RUNNING              0
A  solaris-test3        RUNNING              0

-- GROUP STATE
-- Group           System               Probed     AutoDisabled    State

B  unixchipssg     solaris-test2        Y          N               ONLINE
B  unixchipssg     solaris-test3        Y          N               OFFLINE

checking the service status ( in this case mount point)

# df -h |grep vcvol1
/dev/vx/dsk/unixchipsdg/vcvol1   2.0G    19M   1.9G     1%    /vcvol1

switching the service to another node

# hagrp -switch unixchipssg -to solaris-test3


# hastatus -sum

-- SYSTEM STATE
-- System               State                Frozen

A  solaris-test2        RUNNING              0
A  solaris-test3        RUNNING              0

-- GROUP STATE
-- Group           System               Probed     AutoDisabled    State

B  unixchipssg     solaris-test2        Y          N               OFFLINE
B  unixchipssg     solaris-test3        Y          N               ONLINE


(imp: the default location of the vcs commands will be /opt/VRTSvcs/bin/ and we need to export the path  to avoid the full path details during execution in /etc/profile)

PATH=$PATH:/usr/sbin:/opt/VRTS/bin
export PATH

Thank you for reading the article. please feel free to post your comments and suggestions 





























 

































Monday, February 1, 2016

solaris ZFS

ZFS is a combined file system and logical volume manager designed by sun microsystems . The features of ZFS includes its high scalability,maximum integrity,drive pooling and multiple
RAID levels. ZFS uses concept of storage pools to manage physical storage, instead of volume manager ZFS aggregates devices in to storage pool.The storage pool describes physical characteristics of the storage and act as an arbitrary data store from which file systems can be created.Also you don't need to predefine the size of the file system and file systems inside the ZFS pools will be automatically grow with in the disk space allocated to the storage pool.











High scalability 

ZFS is the 128 bit file system that is capable of zettabites of data ( 1 billion terabytes) of data , no matter how much harddrive you have zfs is capable to manage it.

Maximum Integrity 

All the data inside the ZFS occupied with a checksum which will ensure it's data integrity . You can ensure that your data will not encounter silent data corruption. 

Drive pooling

ZFS is just like our system RAM, when ever you need more space, you need to insert the new HDD only and it will automatically added in to the pool . So no need of headache as formating, initializing, partitioning etc. 

Capability of different RAID levels 

We can configure multiple RAID levels using ZFS and performance wise it is upto the mark with hardware raids. 

Configuration Part

                                                                        Stripped pool

In this case data is stripped between multiple HDD's and no backup is available. The data access speed is high in this case

1. In our server we have below HDD's attached 

bash-3.00# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0d0 <DEFAULT cyl 2085 alt 2 hd 255 sec 63>
          /pci@0,0/pci-ide@1,1/ide@0/cmdk@0,0
       1. c0d1 <VBOX HAR-34e30776-506a21a-0001-1.01GB>
          /pci@0,0/pci-ide@1,1/ide@0/cmdk@1,0
       2. c1d1 <DEFAULT cyl 513 alt 2 hd 128 sec 32>
          /pci@0,0/pci-ide@1,1/ide@1/cmdk@1,0

2. Create the stripped pool

bash-3.00# zpool create unixpool c0d1 c1d1

3. Check the pool status

bash-3.00# zpool status unixpool
  pool: unixpool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        unixpool    ONLINE       0     0     0
          c0d1         ONLINE       0     0     0
          c1d1         ONLINE       0     0     0

errors: No known data errors

bash-3.00# zpool list unixpool
NAME       SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
unixpool  1.98G    78K  1.98G     0%  ONLINE  -
bash-3.00# zfs list unixpool
NAME       USED  AVAIL  REFER  MOUNTPOINT
unixpool  73.5K  1.95G    21K  /unixpool

                                                                  Mirrored Pool

As the name pointed this will have mirrored configuration and occupies backup. Data Read speed is high in this case, but write speed is slow

1. Creating the mirrored pool

bash-3.00# zpool create unixpool mirror c0d1 c1d1

2. Verifying the pool status 

bash-3.00# zpool status unixpool
  pool: unixpool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        unixpool     ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
             c0d1       ONLINE       0     0     0
            c1d1        ONLINE       0     0     0

errors: No known data errors

bash-3.00# zpool list unixpool
NAME       SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
unixpool  1016M   108K  1016M     0%  ONLINE  -

bash-3.00# zfs list unixpool
NAME       USED  AVAIL  REFER  MOUNTPOINT
unixpool  73.5K   984M    21K  /unixpool

Mirroring multiple disks 
*****************
1. Creating the mirrored pool

bash-3.00# zpool create unixpool2m mirror c0d1 c1d1 mirror c2t0d0 c2t1d0

2. verifying the pool status 

bash-3.00# zpool status unixpool2m
  pool: unixpool2m
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        unixpool2m  ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            c0d1          ONLINE       0     0     0
            c1d1          ONLINE       0     0     0
          mirror-1      ONLINE       0     0     0
            c2t0d0       ONLINE       0     0     0
            c2t1d0       ONLINE       0     0     0

errors: No known data errors

bash-3.00# zpool list unixpool2m
NAME         SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
unixpool2m  2.04G    81K  2.04G     0%  ONLINE  -

bash-3.00# zfs list unixpool2m
NAME         USED  AVAIL  REFER  MOUNTPOINT
unixpool2m  76.5K  2.01G    21K  /unixpool2m 


                                                           Raid 5 (RaidZ) spool

This needs minimum 3 HDD's  and able to sustain single HDD failure . If the disks are different sizes we need to use -f option while creating 

1. Creating the raidz

bash-3.00# zpool create -f  unixpoolraidz raidz c0d1 c1d1 c2t0d0

2. Checking the status

  bash-3.00# zpool status unixpoolraidz
  pool: unixpoolraidz
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        unixpoolraidz  ONLINE       0     0     0
          raidz1-0         ONLINE       0     0     0
            c0d1             ONLINE       0     0     0
            c1d1             ONLINE       0     0     0
            c2t0d0          ONLINE       0     0     0

errors: No known data errors

bash-3.00# zpool list unixpoolraidz
NAME            SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
unixpoolraidz  2.98G   150K  2.98G     0%  ONLINE  -

bash-3.00# zfs list unixpoolraidz
NAME            USED  AVAIL  REFER  MOUNTPOINT
unixpoolraidz  93.9K  1.96G  28.0K  /unixpoolraidz


                                                     Raid 6 (RaidZ2)

This needs minimum 4 HDD's and can sustain 2 HDD failures 

1. Creating the raidz2

bash-3.00# zpool create -f unixpoolraidz2 raidz2 c0d1 c1d1 c2t0d0 c2t1d0

2. Checking the status 

bash-3.00# zpool status unixpoolraidz2
  pool: unixpoolraidz2
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        unixpoolraidz2  ONLINE       0     0     0
          raidz2-0           ONLINE       0     0     0
            c0d1              ONLINE       0     0     0
            c1d1              ONLINE       0     0     0
            c2t0d0           ONLINE       0     0     0
            c2t1d0           ONLINE       0     0     0

errors: No known data errors

bash-3.00# zpool list unixpoolraidz2
NAME             SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
unixpoolraidz2  3.97G   338K  3.97G     0%  ONLINE  -

bash-3.00# zfs list unixpoolraidz2
NAME             USED  AVAIL  REFER  MOUNTPOINT
unixpoolraidz2   101K  1.95G  31.4K  /unixpoolraidz2


                                                         Destroying a zpool 

We can destroy the zpool even it is mounted status also 

bash-3.00# zpool destroy unixpoolraidz2

bash-3.00# zpool list unixpoolraidz2
cannot open 'unixpoolraidz2': no such pool

                                                   

                                         Importing and exporting the pool


As a unix administrator you might be facing some situation for storage migration. In that case we have a option called storage pool migration from one storage system to other 
1. We have a pool called unixpool now

bash-3.00# zpool status unixpool
  pool: unixpool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        unixpool    ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c0d1    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0

errors: No known data errors
2. Exporting the unixpool

bash-3.00# zpool export unixpool
3. Checking the status 
bash-3.00# zpool status unixpool
cannot open 'unixpool': no such pool

Once you imported you can see there is no pool is available now. Now we need to import the pool to a new system

4. Check the pool need to be imported using zpool command 

 bash-3.00# zpool import
  pool: unixpool
    id: 13667943168172491796
 state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

        unixpool    ONLINE
          mirror-0  ONLINE
            c0d1    ONLINE
            c1d1    ONLINE
          mirror-1  ONLINE
            c2t0d0  ONLINE
            c2t1d0  ONLINE

5. zpool import unixpool

bash-3.00# zpool status unixpool
  pool: unixpool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        unixpool    ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c0d1    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0

errors: No known data errors

We have successfully imported the unixpool now.

             Zpool scrub option for integity check and repairing 

We have a option for zpool integrity check and repairing just like our fsck in conventional unix file system . We can use zpool scrub for to achive this. From below command you can see the scrub option completed status and timing also

bash-3.00# zpool scrub unixpool
bash-3.00# zpool status unixpool
  pool: unixpool
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Mon Feb  1 18:30:31 2016
config:

        NAME        STATE     READ WRITE CKSUM
        unixpool    ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c0d1    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0

errors: No known data errors



Thank you . Please post your comments and suggestions 


  








































Tuesday, January 26, 2016

Performance analysis in unix - Disk I/O

If you are a sysadmin, sometimes you might be faced in some situation where disk I/O is playing a Villon role in overall system performance(especially in DB systems.). There are verity of reasons for that starting from disk issues to HBA driver issues which we cannot predict. But monitoring and analyzing disk performance is a major role for a sysadmin to avoid any system performance degrade.

The primary tool using to analyse disk performance issues is iostat, also sar -d provide historical performance data along with third party tools called
dtrace for solaris 10 servers.

iostat -xn output





device- disk details of the server
r/s - read per second
w/s - write per second
kr/s - kbytes read per second
kw/s- kbytes written per second
wait - Average number of transactions that are waiting for the service (queue length)
actv - Average number of transactions that are actively being served .
svc_t - Average service time in milliseconds
%w - Percentage of time that queue is not empty
%b - Percentage od time that disk is full busy.

In the above output if svc_t (service time) values is more than 20 ms on the disks that are in use, we can consider it as a sluggish performance.
But now a days for the disks with large amount of cache, it is always advisable to monitor the service time intermediately if disk is not busy also.For example
if the writes and reads on a fiber attached disk cache is increased it will cause the service time to be increased 3-5 ms.

If we consider %b value in above output, if the disk is showing continuously 60 % utilization for a period of time, we can consider as the disk is saturated. Whether the application is really impacted due to this disk utilization can be evaluated by using service time parameter from the above output.

Disk saturation 

The high disk saturation can be measured from %w value from iostat output. High disk saturation will slow the system performance as the number of process queued up will be increased. Ideally %w > 5 can be considered as high disk saturation . In this case setting sd_max_throttile to 64 will be helpful ( sd_max_throttile will determine how many job's are queued up in a single HBA and its default value is 256). Another reason for high %w is due to scsi devices precedence, low scsi ID devices have less precedence than high scsi ID devices.

We need to check the behavior of the disk I/O as whether it is random or  sequential. Sequential I/O which involves while reading or writing large files or folders and it is bit speedy compared to random I/O. This behavior can be analysed by using sar -d command , for example if (blks/s) / (r+w/s) is < 16KB then I/O is random and if the same output is > 128KB then the I/O behavior is sequential.



Disk Errors 

iostat -eE will show the disk error details from last reboot and  we need to consider below parameters while considering disk errors .
*********************************************************************************
bash-3.00# iostat -eE
           ---- errors ---
device  s/w h/w trn tot
cmdk0   0   0   0   0
sd0     0   0   0   0
nfs1    0   0   0   0
cmdk0     Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: VBOX HARDDISK   Revision:  Serial No: VB4d87fd3f-3f00 Size: 17.18GB <1717                                                                                        9803648 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
sd0       Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: VBOX     Product: CD-ROM           Revision: 1.0  Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
*********************************************************************************
Possible solutions for disk I/O problems are given below

1. check the file system kernel parameters to make sure that inode caches are working properly
2.Spread the I/O traffic along with multiple disks (especially with RAID setup  or ZFS)
3. Redesign the problematic process to reduce the number of disk I/O's (especially by cachefs or  application based cache)
4. Setting the proper write throttle value . For example if the ufs_write value is set to 1 ( as default) if the no if writes exceeds ufs_HW then writes are suspended untill the no of writes reaches up to ufs_LW.
(ufs_writes -- if this value is non zero the number of bytes outstanding for the writes in a file will be checked . ufs_HW--- Number of bytes outstanding on a single file maximum value. ufs_LW---when the writes completed and number of writes is less than this value, all the pending (sleeping) process will be awaken and start writes)
5. Database I/O should be done to row disk partitions ( please avoid NFS )

File system performance

When considering file system performance  an important role will be file system latency which will impact I/O performance. Below are the main reasons for file system latency 

1. Disk I/O wait - This will be short as 0 in the event of read cache hit. For a synchronous I/O event this can be altered by adjusting cache parameters. 
2. File system cache misses- Missing of block,buffer,metadata,name look up caches will highly impact file system latency 
3. File system locking - Most file system have file system locking and this will be a major impact in case of bigger files like database files . 
4. Metadata updating - creations,deletions,updation of file extensions will  cause extra latency allows for file system metadata.


    As mentioned earlier file system caches have an important role in I/O performance. The major file system cache's are below 

1. DNLC ( Directory Name Lookup Cache) - This cache looks up the vnode to directory path lookup information which will avoid directory lookup every time on fly.
2. Indode Cache - Stores the file metadata information ( like size/access time) etc. 
3. Rnode Cache - This is stored in NFS clients regarding the information about NFS mount points.
4. Buffer Cache - This is the reference between physical metadata ( eg: block placement on the filesystem) and logical data which is stored in other caches .

Physical disk Layout

The disk layout for a hard drive includes below

1. boot block
2. Super block 
3. Inode list - ( The number of inodes can be changed by mkfs command)
4. data blocks

Inode Layout

Each inode contains below information

1. File type,permission etc
2. Number of hardlinks to the file
3. UID
4.GID
5. byte size
6. array of block addresses
7. generation number ( incremented every time when the inode is re used)
8. access time
9. modification time
10. change time
11. Number of sectors 
12. Shadow inode location ( which can be used with ACL) 

So overall disk I/O performance includes lots of dependencies including application tuning , disk physical setup,different cache sizes etc and as a sysadmin we need to consider all these factors need to be tuned to improve I/O performance as nutshell. 









Wednesday, December 23, 2015

Performance Analysis in solaris 10- Memory

In real world memory performance issues plays a major role in system performance. Unix system memory contains 2 types , one is physical memory which is attached to the DIMM modules of the hardware and second is swap space which is a dedicated space in the disk which is treated as a memory by OS ( since the disk I/O is much slower than the I/O to the memory in generic way we will prefer to use swap space less frequently as possible).
tivity
Swap space is only used when the physical memory is too small to accommodate system memory requirements . At that time space is freed in physical memory by paging (moving) it out to swap space ( also keep it in mind that if the increase in paging to swap space will degrade system CPU performance )

vmstat command
***********






vmstat reports virtual memory statics regarding kernel thread, virtual memory,disk,thread,cpu activity etc. Also note that in multiple CPU systems this will show the average of the number of CPU output. 

The details of the command is given below 

kthr - This indicates the kernel threads details in 3 states . 
    r - The number of kernel threads in run queue
    b - The number of blocked threads which are waiting for I/O paging 
    w - Number of swapped out lightweight processes (LWP - processes which is running under same kernel thread and shares its system resources and addresses with other lwp's)  that are waiting for resources to finish 

memory - Report the usage of real and virtual memory 
    swap - available swap space ( in kb)
    free - size of the free list (in kb)

page - Report about the page faults and paging activity . Details of this section is given below 
    re - page reclaims 
    mf - minor faults 
    pi - kilobytes paged in 
    po - kilobytes paged out
    fr - kilibytes freed
    de - anticipated short term memory shortfall ( in KB)
    sr- pages scanned by clock algorithm 

disk - Reports the number of disk operations per second . There are slots up to 4 disks with a letter and number ( letter indicates the disk type like scsi.ide ) and number is the logical number

faults - Reports trap/interrupt rates 
    in - interrupts
    sy - system calls
    cs- cpu context switches 

cpu - Breakdown usage of the cpu time. In multi processor systems it will be average of all the CPU's

    us - user time
    sys - system time
    id - idle time

swap usage analysis
****************
If you concentrate in swap analysis we need to use below mentioned two commands . 

bash-3.2# swap -s
total: 267032k bytes allocated + 86184k reserved = 353216k used, 964544k available

bash-3.2# swap -l
swapfile dev swaplo blocks free
/dev/zvol/dsk/rpool/swap 181,1 8 2097144 2097144

But there is major difference between these two commands , in the first one we are using 353216k of (964544 + 353216)k  which means of 26% in use. In the second one you can see as all the 2097144 is free means 0% is used.  In the first command (swap -s ) includes the  portion of physical memory also which is using as swap. The major difference in usage of these two commands are in generic if you are checking the swap usage over time you can use swap -s. (if the system performance is good). But if the system performance is degraded you need to concentrate more  about  the change in swap usage and what causes that change  ( also keep it in mind that swap -l displays output in 512 bytes and swap -s displays in 1024 byte blocks) .

If the system run's out of swap space it will show the error messages given below and we might think about expanding the same using creating the swap file. In general while creating the swap you have to provide size as half of the system physical memory

for example if the system memory is 8GB the ideal swap size should be 4GB

***********************************************************
application is out of memory

malloc error O

messages.1:Sep 21 20:52:11 mars genunix: [ID 470503 kern.warning]
WARNING: Sorry, no swap space to grow stack for pid 100295 (myprog)
***********************************************************

Creating the swap file

1.Login as super user
2. Create the swap file using mkfile <name> <size in k/m/g> filename
3. Activate the swap file using /usr/sbin/swap -a /path/filename
4. Add the entry at /etc/vfstab 

   /path/filename - - swap - no -

5. Verify the swap file using /usr/bin/swap -l

As a nutshell while configuring the swap , please keep it in below points 

  • Never allocate swap with size less then 30% of RAM.
  • Determine whether large applications (such as compilers and databases) will be using the /tmp  directory. If one or several of your application have a huge demand for swap space, use the swap -s  command to monitor swap resources on a similar existing system tro get estimate of the actual requirements.

Cache

If we check the free -m command in a unix box we can see major portion of the memory is in cached column. So what is mean by that cache, is it currently used by system?

[root@testserver ~]# free -m
             total       used       free     shared    buffers     cached
Mem:         15976      15195        781          0        167       9153
-/+ buffers/cache:       5874      10102
Swap:         2000          0       1999


In this case you can see 9GB is cached . These caches are called page caches / dirty caches which will act as a temporary memory for read and write process. During the write process the contents of these dirty cache will be periodically transferred to the system storage . Till 2.6.31 version , the process called pdflush will ensure that the data is transferring to system storage and clearing the dirty pages periodically. But after this kernel version there will be a thread for each device ( like sda/sdb) will monitor this mechanism 

root@pc:~# ls -l /dev/sda
brw-rw---- 1 root disk 8, 0 2011-09-01 10:36 /dev/sda
root@pc:~# ls -l /dev/sdb
brw-rw---- 1 root disk 8, 16 2011-09-01 10:36 /dev/sdb
root@pc:~# ps -eaf | grep -i flush
root       935     2  0 10:36 ?        00:00:00 [flush-8:0]
root       936     2  0 10:36 ?        00:00:00 [flush-8:16]



This same mechanism is applicable for reading also, file blocks will be transferred from disk to page cache for reading . For example if you access 100MB file twice , in second time the access will be faster as it is fetching from the cache. If Linux needs more memory for normal applications than is currently available, areas of the Page Cache that are no longer in use will be automatically deleted.

Mostly log files or database dump file ( data files) are mostly accumulated by page cache as it is accessed continuously . So configuring perfect log rotate or zipping it periodically will release the page cache when it will be really needed for system performance .







 


    











Tuesday, December 15, 2015

Performance analysis in solaris 10 - CPU

Performance analysis is one of the key task for every system admins which is an important point for the  the system availability ( especially production systems with SLA basis).We should do the periodical check for the various system parameters and ensure nothing is getting in to wrong way which is hampering the normal operations of the production systems .

The main factores we should consider for system performance analysis are disk IO, CPU,Memory & Swap,network and zones ( i am omitting other service components like  name services , NFS, kernel tuning etc.. which can be discussed separately).

CPU Loading

Load average is the average over time the number of processes in run queue. This is used to represent the load on CPU and  load average refers to three numbers with 1-5-15 minutes intervals . Typically the load average divided by the number of cpu cores are used to find the load per cpu and the load average above 1 per cpu is considered as cpu is fully utilized . Also a general rule of thumb in load average is "average value which is 4 times the number of cpu results a sluggish performance".

Load average can be monitored by the command uptime or monitoring the run queue time of the processors using sar -q command

uptime
*******
bash-3.2# uptime
4:29pm  up 34 day(s), 14:45,  2 users,  load average: 0.45, 0.49, 0.54

The last 3 numbers are the load average which will be 1,5,15 minutes interval. Now we need to find what is load metric . This metric for a particular load at given point of time is  how many processes are queued per the running process ( including the current running ones). For example in last minute if the load average is 0.50 means half of the time of the last minute CPU was idle with out any running processes. Another example of the load average is 2.50 in last minute means average of 1.5 processes are waiting to run in the queue and the CPU was overloaded by 150%

The load average can be monitored by analyzing the run queue length and amount of time to take for that using the sar -q command .

Using the sar-q command we will getto know the following information
1. The average  queue length while the queue is occupied
2. The percentage of time that the queue is occupied.

If you check the command output header you will get below details from sar -q 

SunOS testsolaris 5.10 Generic_144488-05 sun4v    12/15/2015

00:00:00 runq-sz %runocc swpq-sz %swpocc
01:00:01     1.0       1     0.0       0
02:00:00     1.0       1     0.0       0
03:00:01     1.0       1     0.0       0
04:00:00     1.0       1     0.0       0
05:00:00     1.1       1     0.0       0
06:00:00     1.0       1     0.0       0
07:00:00     1.0       1     0.0       0

08:00:01     1.1       5     0.0       0
............................................

Average      1.0       3     0.0       0

run-sz - This indicates the number of kernel threads in the memory which is waiting to occupy the CPU. Normal value of this should be less than 2 and if it is consistently become high the system CPU is fully utilized ( can consider about adding more CPU)

%runocc - This indicates the run queue (dispatch) occupancy . The consistent run queue occupancy is the CPU saturation .

swap-sz - The average number of swapped out processes

%swapocc- The percentage of time in which the processes are swapped out.

So by over all if the %runocc is greater than 90 and runq-sz value is greater than 2 we should consider about adding more CPU for a consistent system performance.

Prstate
***********
This is one of the most widely utilized system utility for below cases

1. How much my system utilized in case of CPU & memory
2. Utilization of the system ( zone wise,user wise,process wise )
3. How are the processes/threads utilizing the system ( user bond, I/O bond)


















PID: the process ID of the process.


USERNAME: the real user (login) name or real user ID.


SIZE: the total virtual memory size of the process, including all mapped files and devices, in kilobytes (K), megabytes (M), or gigabytes (G).


RSS: the resident set size of the process (RSS), in kilobytes (K), megabytes (M), or gigabytes (G).


STATE: the state of the process (cpuN/sleep/wait/run/zombie/stop).


PRI: the priority of the process. Larger numbers mean higher priority.


NICE: nice value used in priority computation. Only processes in certain scheduling classes have a nice value.


TIME: the cumulative execution time for the process.


CPU: The percentage of recent CPU time used by the process. If executing in a non-global zone and the pools facility is active, the percentage will be that of the processors in the processor set in use by the pool to which the zone is bound.


PROCESS: the name of the process (name of executed file).


NLWP: the number of lwps in the process.

Also you can sort the prstat by ascending ( S option) or descending (s option) with respect to below parameters 

cpu - sort by cpu usage ( by default this option is applicable)
pri - By process priority 
rss- Set by resident set size
size- By size of the process image
time- Sort by execution time

If you want the utilization report according to zone wise use prstat -Z . Here you can see global zone and testzone separately 


















Also one more option in prstat which is called microstat accounting (prstat -m) and it will provide the CPU latency , system time, etc


















In nutshell we can assume CPU performance issues as below

1. The number of processes in run queue is greater than the number of CPU's in the system
2. If the process queue is 4 times more than the number of available CPU's in the system
3. Also if the CPU idle time is 0 and system time is double than the user time , then the system is facing some major CPU shrink.

Also we have 3rd party performance analysis tools like Dtrace which will be discussed separately in other occasion .