Monday, October 1, 2012

Log File Sync and Log File Parallel Write Waits

On a routine database health check it was discovered the that the log file sync waits events have increased 10-20 times than the base line values. Graphs generated with ADMon
Log File Sync Waits
Metalink note 34592.1 list the usual causes for high log file sync waits. Assuming all else (meaning hardware related parameters) remains the same and constants the main reason is the high user commit rates. But in this case the user commit rates were consistent throughout the problem period.
User Commits
Same was true for user rollbacks (the drop is due to instance restart) and transaction rollbacks.
User Rollbacks
Transaction Rollback
DB CPU was also consistent throughout the problem period which indicated there was no increase in the system load.
Metalink note 34592.1 show how the log file sync wait time could be broken down. Two metrics related to this case are the "redo write time" metric which is the time taken by lgwr process gathering redo to be written and issuing IO. The second statistic is the log file parallel write (refer 34583.1) wait time which is the "time taken for all the outstanding I/O requests to complete". Log file parallel write wait time graph was similar to the log file sync wait time graph.
Log File Parallel Write
34592.1 also mention several statistics that are related to lgwr performance. These are redo writes,redo blocks written,redo write time,redo wastage and redo size. Definitions of these statistics are from the Oracle Reference guide.
Redo writes is the total number of writes by LGWR to the redo log files and it was also consistent throughout the problem period.
Redo Writes
Redo blocks written is the total number of redo blocks written and it is also consistent and hold the same pattern as before the problem period.
Redo Blocks Written
Redo write time is the total elapsed time of the write from the redo log buffer to the current redo log file in 10s of milliseconds. There a increase in redo write time corresponding to log file sync and log file parallel write graphs indicating more time spent on writing redo to log file.
Redo Write Time
Redo wastage is the number of bytes wasted because redo blocks needed to be written before they are completely full. This statistic was also also consistent throughout the problem period.
Redo Wastage
The last is the redo size which is the total amount of redo generated in bytes. This statistics is also an indication of the system load, if there are more DML then amount of redo size would also increase. However in this case the redo size is also consistent throughout the problem period.
Redo Size
Next redo log space and redo log buffer statistics were looked at. Redo buffer allocation retries are the total number of retries necessary to allocate space in the redo buffer. Though there hasn't been any reduction in redo buffer size the amount of redo buffer allocation retries have gone up indicating lgwr is not clearing the log buffer fast enough.
Redo Buffer Allocation Retries
Redo Buffers
Increase in redo log space requests and redo log space wait time indicates that the wait is not memory bound but IO (or disk) bound. Redo log space requests is the number of times the active log file is full and Oracle must wait for disk space to be allocated for the redo log entries. Increase in this means slow IO. Redo log space wait time is the total elapsed waiting time for "redo log space requests" in 10s of milliseconds. Both these statistics have increased in value over the problem period.
Redo Log Space Requests
Redo Log Space Wait Time
Considering all of the above it seems the traffic is the same (system load, redo generation) but some of the lanes (IO bandwidth) "might be" closed. Looking at the IO time on the OS statistics showed a increased IO time which corresponds with log file sync increase. So the assumption that "all else remains the same" may not be correct.
IO Time
Therefore at this point assumption was that cause for increase in these wait events might be due to "something" in the IO sub system rather than the database or application. The system was a two node cluster using a SAN with RAID 5. Both 34583.1 and 34592.1 advise against placing redo logs on RAID 5 but in this case even with RAID 5 the log file *(sync/parallel write) waits were low. So RAID 5 is not the root cause here.
On further investigating the SAN it was discovered the it had two RAID controllers and battery on one of them has failed (thus closing one lane of traffic).
Replacing the dead battery and bringing the RAID controller online (opening up the lane) made the waits to go back to previous level.
Log File Parallel Write

Below is a different system's log file *(sync/parallel write) waits and in this case the increase in waits is caused by a burst of user commits.
Log File Sync
Log File Parallel Write
User Commits
Increase in these two wait events could happen due to user activity as well as due to change in IO related hardware (partial failure as in this case). Having a base line taken when the performance is optimal, to compare with the problem period could help in diagnosing and identifying the reason for the increase.

Useful metalink notes
WAITEVENT: "log file sync" Reference Note [ID 34592.1]
WAITEVENT: "log file parallel write" Reference Note [ID 34583.1]