A! Help: Gathering stats for gc lost blocks

Tuesday, May 31, 2011

Gathering stats for gc lost blocks

In a two node RAC gc cr block lost appeared to be second in the top 5 wait events list. But this was only happening in one of the nodes whereas other node has almost zero waiting on this event. Stats plotted with ADMon taken from statspack base tables

Waits

Wait Times

gc lost blocks diagnostics [ID 563566.1] describe what to look for to diagnose this wait event. Among them is stats related to dropped and fragmented packets. Looking at the two following could be observed, on the problem node

 netstat -s
Ip:
   650337997 total packets received
   145 with invalid addresses
   0 forwarded
   0 incoming packets discarded
   422623145 incoming packets delivered
   990924957 requests sent out
   1718163 fragments dropped after timeout
   271652117 reassemblies required
   43937411 packets reassembled ok
   4233888 packet reassembles failed
   42533293 fragments received ok
   259374063 fragments created

On the non-problem node

Ip:
   2896602613 total packets received
   4445 with invalid addresses
   0 forwarded
   0 incoming packets discarded
   1690637959 incoming packets delivered
   2164503370 requests sent out
   4433 outgoing packets dropped
   4 fragments dropped after timeout
   1430346149 reassemblies required
   224385940 packets reassembled ok
   4 packet reassembles failed
   239685318 fragments received ok
   52 fragments failed
   1386980139 fragments created

Solution for this is to increase the reassembly buffer space (ipfrag_low_thresh,ipfrag_high_thresh and ipfrag_time). But this didn't help either. As per above metalink note "In most cases, gc buffer lost has been attributed to (a) A missing OS patch (b) Bad network card (c) Bad cable (d) Bad switch (e) One of the network settings.".

Following shell script could be used to store the IP packet stats relating to the above wait event in a database table.

SQL>create table ipfrag(stat_time timestamp default systimestamp, fragments_dropped number, reassemblies_required number, reassembled_ok number, reassemble_failed number);

$ netstat -s | grep 'Ip' -A 12 | grep -E 'reassembl|dropped' | awk 'BEGIN {FS=" "} $2~/^fragments/ {dropped = $1} $2~/^reassemblies/ {res_required = $1}
$2~/^packets/ {res_ok = $1}  $3~/^reassembles/ {res_fail = $1; state=3} state == 3 {printf "%s%d,%d,%d,%d%s\n","insert into ipfrag (fragments_dropped
,reassemblies_required,reassembled_ok,reassemble_failed) values(",dropped,res_required,res_ok,res_fail,");"}' | sqlplus -s username/pw

Update on 06-06-2011
These servers are hosted in a hosting company's data centre and problem turned out to be a mismatch in a switch setting. Once fixed (speed and duplex was set to auto/auto in this case) and gc cr block lost waits went away on the problem node as well. Graphs taken with ADMon after the fix.

A! Help

Labels

Tuesday, May 31, 2011

Gathering stats for gc lost blocks

About Me

Downloads

Quick Response

Popular

Blog Archive

Total Pageviews

Followers

Oracle Documentation