Waits
Wait Timesgc lost blocks diagnostics [ID 563566.1] describe what to look for to diagnose this wait event. Among them is stats related to dropped and fragmented packets. Looking at the two following could be observed, on the problem node
netstat -sOn the non-problem node
Ip:
650337997 total packets received
145 with invalid addresses
0 forwarded
0 incoming packets discarded
422623145 incoming packets delivered
990924957 requests sent out
1718163 fragments dropped after timeout
271652117 reassemblies required
43937411 packets reassembled ok
4233888 packet reassembles failed
42533293 fragments received ok
259374063 fragments created
Ip:Solution for this is to increase the reassembly buffer space (ipfrag_low_thresh,ipfrag_high_thresh and ipfrag_time). But this didn't help either. As per above metalink note "In most cases, gc buffer lost has been attributed to (a) A missing OS patch (b) Bad network card (c) Bad cable (d) Bad switch (e) One of the network settings.".
2896602613 total packets received
4445 with invalid addresses
0 forwarded
0 incoming packets discarded
1690637959 incoming packets delivered
2164503370 requests sent out
4433 outgoing packets dropped
4 fragments dropped after timeout
1430346149 reassemblies required
224385940 packets reassembled ok
4 packet reassembles failed
239685318 fragments received ok
52 fragments failed
1386980139 fragments created
Following shell script could be used to store the IP packet stats relating to the above wait event in a database table.
SQL>create table ipfrag(stat_time timestamp default systimestamp, fragments_dropped number, reassemblies_required number, reassembled_ok number, reassemble_failed number);Update on 06-06-2011
$ netstat -s | grep 'Ip' -A 12 | grep -E 'reassembl|dropped' | awk 'BEGIN {FS=" "} $2~/^fragments/ {dropped = $1} $2~/^reassemblies/ {res_required = $1}
$2~/^packets/ {res_ok = $1} $3~/^reassembles/ {res_fail = $1; state=3} state == 3 {printf "%s%d,%d,%d,%d%s\n","insert into ipfrag (fragments_dropped
,reassemblies_required,reassembled_ok,reassemble_failed) values(",dropped,res_required,res_ok,res_fail,");"}' | sqlplus -s username/pw
These servers are hosted in a hosting company's data centre and problem turned out to be a mismatch in a switch setting. Once fixed (speed and duplex was set to auto/auto in this case) and gc cr block lost waits went away on the problem node as well. Graphs taken with ADMon after the fix.