Wednesday, March 5, 2008

OS Watcher

Download information is available on metlink note 301137.1

Data collectors exist for top, vmstat, iostat, mpstat, netstat, ps and an optional collector for tracing private networks. To turn on data collection for private networks the user must create an executable file in the osw directory named private.net. An example of what this file should look like is named Example private.net in the osw directory. This file can be edited and renamed private.net or a new file named private.net can be created. This file contains entries for running the traceroute command to verify RAC private networks.

Start OSW

startOSW.sh interval_in_sec number_of_hours_to_archive_the_data
if not arguments are present then default will be every 10 seconds and 48 hours of data will be archived.
It's a moving window of 48 hours, meaning on the 49th hour the 1st hour's data will be deleted leaving 48 hours worth of data on the system.

Use nohup ./startOSW.sh 60 10 & to start OSW in background.

Stop OSW

to stop the OSW use stopOSW.sh

OSW Output

OSW output is available in OSW_HOME/archive directory. It will have a seperate subdirecotry for each Os utility. Output file will have the following format for its name node_name_OS_utility_YY.MM.DD.HH24.dat

oswiostat

Field Description
r/s :- Shows the number of reads/second
w/s :- Shows the number of writes/second
kr/s :- Shows the number of kilobytes read/second
kw/s :- Shows the number of kilobytes written/second
wait :- Average number of transactions waiting for service (queue length)
actv :- Average number of transactions actively being serviced
wsvc_t :- Average service time in wait queue, in milliseconds
asvc_t :- Average service time of active transactions, in milliseconds
%w :- Percent of time there are transactions waiting for service
%b :- Percent of time the disk is busy device Device name

what to look for

  1. Average service times greater than 20msec for long duration.
  2. High average wait times.


oswmpstat

Field Description
cpu Processor ID
minf Minor faults
mif Major Faults
xcal Processor cross-calls (when one CPU wakes up another by interrupting it).
intr Interrupts
ithr Interrupts as threads (except clock)
csw Context switches
icsw Involuntary context switches
migr Thread migrations to another processor
smtx Number of times a CPU failed to obtain a mutex
srw Number of times a CPU failed to obtain a read/write lock on the first try
syscl Number of system calls
usr Percentage of CPU cycles spent on user processes
sys Percentage of CPU cycles spent on system processes
wt Percentage of CPU cycles spent waiting on event
idl Percentage of unused CPU cycles or idle time when the CPU is basically doing nothing

what to look for

  1. Involuntary context switches (this is probably the more relevant statistic when examining performance issues.)
  2. Number of times a CPU failed to obtain a mutex. Values consistently greater than 200 per CPU causes system time to increase.
  3. xcal is very important, show processor migration

OSW graph comes bundled with OSW v2.0.0 and higher. Document is available on metalink note 461053.1

It requires java version 1.4 or higher. to use simply run

java -jar -Xmx512M OSWg.jar -i /home/osw/archive


where /home/osw/archive is the path to the direcotry containing the output subdirectories.


For RAC environment to monitor private network copy the provided Exampleprivate.net as private.net into the same directory. Remove all commands in the private.net except for the ones that are matching the current OS. Replace the node names in the private.net with actual node names of the system.