Tuesday, May 31, 2011

Gathering stats for gc lost blocks

In a two node RAC gc cr block lost appeared to be second in the top 5 wait events list. But this was only happening in one of the nodes whereas other node has almost zero waiting on this event. Stats plotted with ADMon taken from statspack base tables

Wait Timesgc lost blocks diagnostics [ID 563566.1] describe what to look for to diagnose this wait event. Among them is stats related to dropped and fragmented packets. Looking at the two following could be observed, on the problem node
 netstat -s
650337997 total packets received
145 with invalid addresses
0 forwarded
0 incoming packets discarded
422623145 incoming packets delivered
990924957 requests sent out
1718163 fragments dropped after timeout
271652117 reassemblies required
43937411 packets reassembled ok
4233888 packet reassembles failed

42533293 fragments received ok
259374063 fragments created
On the non-problem node
2896602613 total packets received
4445 with invalid addresses
0 forwarded
0 incoming packets discarded
1690637959 incoming packets delivered
2164503370 requests sent out
4433 outgoing packets dropped
4 fragments dropped after timeout
1430346149 reassemblies required
224385940 packets reassembled ok
4 packet reassembles failed

239685318 fragments received ok
52 fragments failed
1386980139 fragments created
Solution for this is to increase the reassembly buffer space (ipfrag_low_thresh,ipfrag_high_thresh and ipfrag_time). But this didn't help either. As per above metalink note "In most cases, gc buffer lost has been attributed to (a) A missing OS patch (b) Bad network card (c) Bad cable (d) Bad switch (e) One of the network settings.".

Following shell script could be used to store the IP packet stats relating to the above wait event in a database table.
SQL>create table ipfrag(stat_time timestamp default systimestamp, fragments_dropped number, reassemblies_required number, reassembled_ok number, reassemble_failed number);

$ netstat -s | grep 'Ip' -A 12 | grep -E 'reassembl|dropped' | awk 'BEGIN {FS=" "} $2~/^fragments/ {dropped = $1} $2~/^reassemblies/ {res_required = $1}
$2~/^packets/ {res_ok = $1} $3~/^reassembles/ {res_fail = $1; state=3} state == 3 {printf "%s%d,%d,%d,%d%s\n","insert into ipfrag (fragments_dropped
,reassemblies_required,reassembled_ok,reassemble_failed) values(",dropped,res_required,res_ok,res_fail,");"}' | sqlplus -s username/pw
Update on 06-06-2011
These servers are hosted in a hosting company's data centre and problem turned out to be a mismatch in a switch setting. Once fixed (speed and duplex was set to auto/auto in this case) and gc cr block lost waits went away on the problem node as well. Graphs taken with ADMon after the fix.