Thursday, March 24, 2011

Stuck Archiver Processes and FAL Gap Resolution Not Working

On a RAC to RAC data guard configuration ( network wait class and waits on LGWR LNS seem to appear out of the blue. System has been running for a quite a number of years and there has not been any changes to network or any other hardware (NIC, cables and etc). Output from the emconsole
Drilling down to "other" wait class could see following LGWR LNS wait
The wait histogram showed a constant value of 16ms
Following was observed in the network wait class drill down
Metalink note Data Guard Wait Events (233491.1) describes these wait event as "ARCH wait on ATTACH - This wait event monitors the amount of time spent by all archive processes to spawn an RFS connection. The LGWR-LNS wait on channel wait event is for standby destinations configured with either the LGWR ASYNC or LGWR SYNC=PARALLEL attributes.
LGWR-LNS wait on channel - This wait event monitors the amount of time spent by the log writer (LGWR) process or the network server processes waiting to receive messages on KSR channels.

During this time there was a considerable archive gap between primary and standby and FAL gap resolution seems unable to resolve it. (FAL use to work fine).

The thought of making changes to transport related values in Oracle net (send and receive buffer) was suppressed since there has not been any hardware changes.
Metalink didn't give any more than definition for these waits.
Googling yield a forum post with a mention of metalink note Bug 5576816 - FAL gap resolution does not work with max_connection set in some scenario. This was applicable for not but the recommendation on the posting (found through googling) was to kill all the archive processes in primary(these will get restated as soon as they get killed). Reason given was that these archive processes were "stuck" and need restart. Looking at the emconsole it was also visible waits were happening on the archive processes.

Tried to kill the archive processors "proper way" by changing the log_archive_max_processes to 1 but this didn't kill any of the processes. Even after setting it to 1 all the archive processes were running. Then did a rolling shutdown and start up of the primary which resolved the issue.

Unfortunately the issue was back after few days on one of the nodes. This time killing the Oracle database session of the archive processes waiting for these wait events resolved it. It seem the archive processes being "stuck" is the symptom and cause could be something else.
Blog post will be updated ...