Wednesday, January 2, 2013

enq: US - contention and row cache lock waits

High waits on enq: US - contention and row cache lock was seen during a load test on a two node cluster. On the statspack reports these two wait events were among the top 5 wait evetns
Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
enq: US - contention                             7,498       4,035    538   21.9
row cache lock                                   8,486       1,240    146    6.7
During normal operation there was no waits on enq: US - contention and row cache lock waits were between 0 - 4. High wait events only appear during the load test when the system is stressed.
The first peaks on the following graphs corresponds to high waits on above events observed during the initial load test.
enq: US - contention

row cache lock waits

Although 1332738.1 suggested this is related to undo segments and could be seen dc_rollback_segments. But there was not much difference between this metric during problem period and a good period. Below is the problem period
Cache                         Requests   Miss    Reqs  Miss     Reqs      Usage
------------------------- ------------ ------ ------- ----- -------- ----------
dc_objects                     105,703    0.1       0              0      5,389
dc_rollback_segments            38,780    0.3       0            250        514
dc_segments                      4,234    5.0       0             12      2,719
dc_tablespaces                 165,248    0.0       0              0         21
dc_users                       178,080    0.0       0              0        222
The good period
Cache                         Requests   Miss    Reqs  Miss     Reqs      Usage
------------------------- ------------ ------ ------- ----- -------- ----------
dc_objects                     284,414    0.5       0              8      2,753
dc_rollback_segments            22,307    0.0       0              0        515
dc_segments                     17,724    7.9       0             10      1,790
dc_tablespaces                 142,346    0.0       0              0         21
dc_users                       158,440    0.0       0              0        116
Comparing the above two there's only a slight difference but comparing GES stats shows following for problem period
Cache                         Requests    Conflicts     Releases
------------------------- ------------ ------------ ------------
dc_objects                          84            2            0
dc_rollback_segments               511          133            0
dc_segments                        352            6            0
and no requests or conflicts for dc_rollback_segments during good period.
Datafile assigned to undo tablespaces has auto extensible on and has enough free space on the disk to extend the datafile. Therefore 420525.1 and 413732.1 wasn't much of a help.
742035.1 and 7291739.8 mentions bug 7291739 which materializes in high contention on above two wait events when autotuned undo retention is in use. Therefore applied the patch for bug 7291739 and set the parameter _first_spare_parameter value to the run length of the longest running query found on v$undostat(as this is 11.1, other version may require HIGHTHRESHOLD_UNDORETENTION refer above mention notes). Running the load test again didn't show any improvement and high waits could still be seen (second peak on the above graphs).

Raised a SR. Oracle couldn't determine why the patch is not effective in reducing the high wait events and suggested another hidden parameter rollback_segment_count(also mentioned as a work around on 1332738.1) It was recommended to set a value of 1.5 times the online undo segments for this parameter.
SQL> select TABLESPACE_NAME,count(*) from DBA_ROLLBACK_SEGS where status='ONLINE' group by tablespace_name;

--------------- ----------
UNDOTBS1               323
SYSTEM                   1
UNDOTBS2               300
According to Oracle the value set is for "entire instance not for undo tablespace" which I would imagine means per database and not per instance. This value act as the "lower limit for the number of undo segments online at a given time". Setting this value doesn't result in database proactively online number of undo segments as specified. It is the minimum number of undo segments to kept online and only comes into play if the number of undo segments goes beyond the value specified. So going by the above statistics the value to set would be (323 + 300) x 1.5 = 935. One more thing is that this value is not dynamic and requires a restart (Not an ideal workaround for a busy production system).
After above value is set running the load test did not result in any enq: US - contention waits. It should be noted that patch was still in place even with this parameter set, but highly unlikely that it had contributed to resolve the high waits. It is possible that rollback_segment_count alone is responsible for reducing the high waits. This would be verified once the patch is roll backed later on.

Useful metalink notes
Full UNDO Tablespace In 10gR2 [ID 413732.1]
Contention Under Auto-Tuned Undo Retention [ID 742035.1]
Automatic Tuning of Undo_retention Causes Space Problems [ID 420525.1]
How to correct performance issues with enq: US - contention related to undo segments [ID 1332738.1]
Bug 7291739 - Contention with auto-tuned undo retention or high TUNED_UNDORETENTION [ID 7291739.8]