Monday, April 26, 2021

Tracing ASYNC Redo Transport When Data Guard Has Far Sync Instance

One way to find out cause for transport lags in data guard is to trace the redo transport related events. For async transport refer the white paper here. However, the trace event 16421 with level 1 did not produce any trace files when the data guard configuration had a far sync instance which was sending redo in async mode to several standby instances. The data guard configuration is given below.
DGMGRL> show configuration

Configuration - db_dg

  Protection Mode: MaxAvailability
  Members:
  db2  - Primary database
    db1  - Physical standby database
    db3  - Physical standby database
      db4  - Physical standby database (receiving current redo)
    fs1  - Far sync instance
      db5  - Physical standby database
      db6  - Physical standby database
      db7  - Physical standby database
      db8  - Physical standby database
      db9  - Physical standby database
      db10 - Physical standby database

  Members Not Receiving Redo:
  fs2  - Far sync instance (alternate of fs1)

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 57 seconds ago)
The only async transport in this configuration was happenign between db3 -> db4 and fs1[ or fs2] -> {db5-db10}. White paper mentioned above tells to add the trace even to all instances, both primary and standby. Applying the 16421 at level 1 to all instances (though some of them are doing SYNC transport) including far sync instances didn't result in any trace files being generated.
After a SR oracle support instructed to apply event 16410 with level 16 together with 16421 with level. It too didn't generate any trace files related to async transprot.
Next it was asked to set log_archive_trace=3360 and add two additional events 10724,level 1 and 10293,level 1 while the above two events also remained in place. This combination too failed to generate any trace files related to async transport.
Following this oracle support asked to increase the level on the event 16421. First to 2 if that didn't produced any result then to 3. As such 16421, level 2 was set on all instances. This produce the trace event similar to what is mentioned on the white paper. Trace files were generated on all the source instances of async transport. In this case two trace files were generated. One in db3 which does async transpport to db4 and another on fs1, the far sync instances which does async transport to db5-d10.
As this produced results didn't try the 16421 with level 3.

The trace file generated has stats for all async destination (denoted by log_archive_dest_n parameters).
krsb_end: Begin stats dump for T-1.S-57745
  max number of buffers in use             1
  Operation elapsed time (micro seconds)   2990220
  File transfer time (micro seconds)       2990215
  Network Statistics
   LOG_ARCHIVE_DEST_3 : OCI REQUEST          
    Total count  : OCI REQUEST             6
    Total time   : OCI REQUEST             306754
    Average time : OCI REQUEST             51125
   LOG_ARCHIVE_DEST_4 : OCI REQUEST          
    Total count  : OCI REQUEST             6
    Total time   : OCI REQUEST             210057
    Average time : OCI REQUEST             35009
   LOG_ARCHIVE_DEST_5 : OCI REQUEST          
    Total count  : OCI REQUEST             6
    Total time   : OCI REQUEST             739457
    Average time : OCI REQUEST             123242
   LOG_ARCHIVE_DEST_6 : OCI REQUEST          
    Total count  : OCI REQUEST             6
    Total time   : OCI REQUEST             593044
    Average time : OCI REQUEST             98840
   LOG_ARCHIVE_DEST_7 : OCI REQUEST          
    Total count  : OCI REQUEST             6
    Total time   : OCI REQUEST             433191
    Average time : OCI REQUEST             72198
   LOG_ARCHIVE_DEST_8 : OCI REQUEST          
    Total count  : OCI REQUEST             6
    Total time   : OCI REQUEST             534170
    Average time : OCI REQUEST             89028
   All TARGETS : OCI REQUEST          
    Total count  : OCI REQUEST             36
    Total time   : OCI REQUEST             2816673
    Average time : OCI REQUEST             78240
   LOG_ARCHIVE_DEST_3 : NETWORK SEND
    Total count  : NETWORK SEND            4
    Total time   : NETWORK SEND            207
    Average time : NETWORK SEND            51
    Total data buffers queued              2
    Total data buffers completed           2
    Total bytes written                    12288
    Total bytes completed synchronously    12288
    Average network send size (blocks)     0
    Average network send buffers           0.50
    Average buffer turnaround time         70
    Throughput (MB/s)                      0.00
   LOG_ARCHIVE_DEST_4 : NETWORK SEND
    Total count  : NETWORK SEND            4
    Total time   : NETWORK SEND            131
    Average time : NETWORK SEND            32
    Total data buffers queued              2
    Total data buffers completed           2
    Total bytes written                    12288
    Total bytes completed synchronously    12288
    Average network send size (blocks)     0
    Average network send buffers           0.50
    Average buffer turnaround time         26
    Throughput (MB/s)                      0.00
   LOG_ARCHIVE_DEST_5 : NETWORK SEND
    Total count  : NETWORK SEND            4
    Total time   : NETWORK SEND            138
    Average time : NETWORK SEND            34
    Total data buffers queued              2
    Total data buffers completed           2
    Total bytes written                    12288
    Total bytes completed synchronously    12288
    Average network send size (blocks)     0
    Average network send buffers           0.50
    Average buffer turnaround time         24
    Throughput (MB/s)                      0.00
   LOG_ARCHIVE_DEST_6 : NETWORK SEND
    Total count  : NETWORK SEND            4
    Total time   : NETWORK SEND            182
    Average time : NETWORK SEND            45
    Total data buffers queued              2
    Total data buffers completed           2
    Total bytes written                    12288
    Total bytes completed synchronously    12288
    Average network send size (blocks)     0
    Average network send buffers           0.50
    Average buffer turnaround time         24
    Throughput (MB/s)                      0.00
   LOG_ARCHIVE_DEST_7 : NETWORK SEND
    Total count  : NETWORK SEND            4
    Total time   : NETWORK SEND            129
    Average time : NETWORK SEND            32
    Total data buffers queued              2
    Total data buffers completed           2
    Total bytes written                    12288
    Total bytes completed synchronously    12288
    Average network send size (blocks)     0
    Average network send buffers           0.50
    Average buffer turnaround time         23
    Throughput (MB/s)                      0.00
   LOG_ARCHIVE_DEST_8 : NETWORK SEND
    Total count  : NETWORK SEND            4
    Total time   : NETWORK SEND            219
    Average time : NETWORK SEND            54
    Total data buffers queued              2
    Total data buffers completed           2
    Total bytes written                    12288
    Total bytes completed synchronously    12288
    Average network send size (blocks)     0
    Average network send buffers           0.50
    Average buffer turnaround time         97
    Throughput (MB/s)                      0.00
   All Targets : NETWORK SEND
    Total count  : NETWORK SEND            24
    Total time   : NETWORK SEND            1006
    Average time : NETWORK SEND            41
    Total data buffers queued              12
    Total data buffers completed           12
    Total bytes written                    73728
    Total bytes completed synchronously    73728
    Average network send size (blocks)     0
    Average network send buffers           0
    Throughput (MB/s)                      0.02
   Total network layer time                2817679
   Percentage of time in network           94.23
  Disk Statistics
    Total count  : DISK READ               2
    Total time   : DISK READ               590
    Average time : DISK READ               295
    Disk stall blocks                      2
    Total count  : BUFFER RELEASE          1
    Total time   : BUFFER RELEASE          1
    Average time : BUFFER RELEASE          1
   Total disk layer time                   591
   Percentage of time in disk layer        0.02
  Data Guard Processing Statistics
   Total DG layer time                     171945
   Percentage of time in DG layer          5.75
  Remote Server-Side Network Statistics
   LOG_ARCHIVE_DEST_3 : NETWORK GET
    Total count  : NETWORK GET             4
    Total bytes  : NETWORK GET             4096
    Total time   : NETWORK GET             73
    Average time : NETWORK GET             18
   LOG_ARCHIVE_DEST_4 : NETWORK GET
    Total count  : NETWORK GET             4
    Total time   : NETWORK GET             63
    Average time : NETWORK GET             15
   LOG_ARCHIVE_DEST_5 : NETWORK GET
    Total count  : NETWORK GET             4
    Total bytes  : NETWORK GET             4096
    Total time   : NETWORK GET             61
    Average time : NETWORK GET             15
   LOG_ARCHIVE_DEST_6 : NETWORK GET
    Total count  : NETWORK GET             4
    Total time   : NETWORK GET             122
    Average time : NETWORK GET             30
   LOG_ARCHIVE_DEST_7 : NETWORK GET
    Total count  : NETWORK GET             4
    Total bytes  : NETWORK GET             4096
    Total time   : NETWORK GET             95
    Average time : NETWORK GET             23
   LOG_ARCHIVE_DEST_8 : NETWORK GET
    Total count  : NETWORK GET             4
    Total bytes  : NETWORK GET             4096
    Total time   : NETWORK GET             72
    Average time : NETWORK GET             18
   All TARGETS : NETWORK GET
    Total count  : NETWORK GET             24
    Total bytes  : NETWORK GET             16384
    Total time   : NETWORK GET             486
    Average time : NETWORK GET             20
   LOG_ARCHIVE_DEST_3 : NETWORK STALL REAP
    Total count  : NETWORK STALL REAP      1
    Total time   : NETWORK STALL REAP      1002872
    Average time : NETWORK STALL REAP      1002872
   LOG_ARCHIVE_DEST_4 : NETWORK STALL REAP
    Total count  : NETWORK STALL REAP      2
    Total bytes  : NETWORK STALL REAP      4096
    Total time   : NETWORK STALL REAP      926514
    Average time : NETWORK STALL REAP      463257
   LOG_ARCHIVE_DEST_5 : NETWORK STALL REAP
    Total count  : NETWORK STALL REAP      1
    Total time   : NETWORK STALL REAP      676984
    Average time : NETWORK STALL REAP      676984
   LOG_ARCHIVE_DEST_6 : NETWORK STALL REAP
    Total count  : NETWORK STALL REAP      2
    Total bytes  : NETWORK STALL REAP      4096
    Total time   : NETWORK STALL REAP      384819
    Average time : NETWORK STALL REAP      192409
   LOG_ARCHIVE_DEST_7 : NETWORK STALL REAP
    Total count  : NETWORK STALL REAP      1
    Total time   : NETWORK STALL REAP      214786
    Average time : NETWORK STALL REAP      214786
   All TARGETS : NETWORK STALL REAP
    Total count  : NETWORK STALL REAP      7
    Total bytes  : NETWORK STALL REAP      8192
    Total time   : NETWORK STALL REAP      3205975
    Average time : NETWORK STALL REAP      457996
   Total server-side network layer time    3206461
   Percentage of time in network           107.23
  Remote Server-Side Disk Statistics
   LOG_ARCHIVE_DEST_3 : DISK WRITE
    Total count  : DISK WRITE              1
    Total time   : DISK WRITE              29
    Average time : DISK WRITE              29
   LOG_ARCHIVE_DEST_4 : DISK WRITE
    Total count  : DISK WRITE              1
    Total time   : DISK WRITE              64
    Average time : DISK WRITE              64
   LOG_ARCHIVE_DEST_5 : DISK WRITE
    Total count  : DISK WRITE              1
    Total time   : DISK WRITE              33
    Average time : DISK WRITE              33
   LOG_ARCHIVE_DEST_6 : DISK WRITE
    Total count  : DISK WRITE              1
    Total time   : DISK WRITE              53
    Average time : DISK WRITE              53
   LOG_ARCHIVE_DEST_7 : DISK WRITE
    Total count  : DISK WRITE              1
    Total time   : DISK WRITE              44
    Average time : DISK WRITE              44
   LOG_ARCHIVE_DEST_8 : DISK WRITE
    Total count  : DISK WRITE              1
    Total time   : DISK WRITE              21
    Average time : DISK WRITE              21
   All TARGETS : DISK WRITE
    Total count  : DISK WRITE              6
    Total time   : DISK WRITE              244
    Average time : DISK WRITE              40
   LOG_ARCHIVE_DEST_3 : DISK NOSTALL REAP
    Total count  : DISK NOSTALL REAP       2
    Total time   : DISK NOSTALL REAP       39
    Average time : DISK NOSTALL REAP       19
   LOG_ARCHIVE_DEST_4 : DISK NOSTALL REAP
    Total count  : DISK NOSTALL REAP       2
    Total time   : DISK NOSTALL REAP       51
    Average time : DISK NOSTALL REAP       25
   LOG_ARCHIVE_DEST_5 : DISK NOSTALL REAP
    Total count  : DISK NOSTALL REAP       2
    Total time   : DISK NOSTALL REAP       37
    Average time : DISK NOSTALL REAP       18
   LOG_ARCHIVE_DEST_6 : DISK NOSTALL REAP
    Total count  : DISK NOSTALL REAP       2
    Total time   : DISK NOSTALL REAP       38
    Average time : DISK NOSTALL REAP       19
   LOG_ARCHIVE_DEST_7 : DISK NOSTALL REAP
    Total count  : DISK NOSTALL REAP       2
    Total time   : DISK NOSTALL REAP       21
    Average time : DISK NOSTALL REAP       10
   LOG_ARCHIVE_DEST_8 : DISK NOSTALL REAP
    Total count  : DISK NOSTALL REAP       1
    Total time   : DISK NOSTALL REAP       4
    Average time : DISK NOSTALL REAP       4
   All TARGETS : DISK NOSTALL REAP
    Total count  : DISK NOSTALL REAP       11
    Total time   : DISK NOSTALL REAP       190
    Average time : DISK NOSTALL REAP       17
   LOG_ARCHIVE_DEST_3 : BUFFER GET
    Total count  : BUFFER GET              1
    Total time   : BUFFER GET              5
    Average time : BUFFER GET              5
   LOG_ARCHIVE_DEST_4 : BUFFER GET
    Total count  : BUFFER GET              1
    Total time   : BUFFER GET              4
    Average time : BUFFER GET              4
   LOG_ARCHIVE_DEST_5 : BUFFER GET
    Total count  : BUFFER GET              1
    Total time   : BUFFER GET              4
    Average time : BUFFER GET              4
   LOG_ARCHIVE_DEST_6 : BUFFER GET
    Total count  : BUFFER GET              1
    Total time   : BUFFER GET              5
    Average time : BUFFER GET              5
   LOG_ARCHIVE_DEST_7 : BUFFER GET
    Total count  : BUFFER GET              1
    Total time   : BUFFER GET              5
    Average time : BUFFER GET              5
   LOG_ARCHIVE_DEST_8 : BUFFER GET
    Total count  : BUFFER GET              1
    Total time   : BUFFER GET              4
    Average time : BUFFER GET              4
   All TARGETS : BUFFER GET
    Total count  : BUFFER GET              6
    Total time   : BUFFER GET              27
    Average time : BUFFER GET              4
   Total server-side disk layer time       461
   Percentage of time in disk layer        0.02
  Remote Server-Side Data Guard Processing Statistics
   LOG_ARCHIVE_DEST_3 : SLEEP TIME
    Total count  : SLEEP TIME              1
    Total time   : SLEEP TIME              10097
    Average time : SLEEP TIME              10097
   LOG_ARCHIVE_DEST_4 : SLEEP TIME
    Total count  : SLEEP TIME              1
    Total time   : SLEEP TIME              10166
    Average time : SLEEP TIME              10166
   LOG_ARCHIVE_DEST_5 : SLEEP TIME
    Total count  : SLEEP TIME              1
    Total time   : SLEEP TIME              10151
    Average time : SLEEP TIME              10151
   LOG_ARCHIVE_DEST_6 : SLEEP TIME
    Total count  : SLEEP TIME              1
    Total time   : SLEEP TIME              10086
    Average time : SLEEP TIME              10086
   LOG_ARCHIVE_DEST_7 : SLEEP TIME
    Total count  : SLEEP TIME              1
    Total time   : SLEEP TIME              10097
    Average time : SLEEP TIME              10097
   All TARGETS : SLEEP TIME
    Total count  : SLEEP TIME              5
    Total time   : SLEEP TIME              50597
    Average time : SLEEP TIME              10119
   LOG_ARCHIVE_DEST_3 : PUBLISH RTA BOUNDARY
    Total count  : PUBLISH RTA BOUNDARY    1
    Total time   : PUBLISH RTA BOUNDARY    6
    Average time : PUBLISH RTA BOUNDARY    6
   LOG_ARCHIVE_DEST_4 : PUBLISH RTA BOUNDARY
    Total count  : PUBLISH RTA BOUNDARY    1
    Total time   : PUBLISH RTA BOUNDARY    8
    Average time : PUBLISH RTA BOUNDARY    8
   LOG_ARCHIVE_DEST_5 : PUBLISH RTA BOUNDARY
    Total count  : PUBLISH RTA BOUNDARY    1
    Total time   : PUBLISH RTA BOUNDARY    5
    Average time : PUBLISH RTA BOUNDARY    5
   LOG_ARCHIVE_DEST_6 : PUBLISH RTA BOUNDARY
    Total count  : PUBLISH RTA BOUNDARY    1
    Total time   : PUBLISH RTA BOUNDARY    5
    Average time : PUBLISH RTA BOUNDARY    5
   LOG_ARCHIVE_DEST_7 : PUBLISH RTA BOUNDARY
    Total count  : PUBLISH RTA BOUNDARY    1
    Total time   : PUBLISH RTA BOUNDARY    4
    Average time : PUBLISH RTA BOUNDARY    4
   All TARGETS : PUBLISH RTA BOUNDARY
    Total count  : PUBLISH RTA BOUNDARY    5
    Total time   : PUBLISH RTA BOUNDARY    28
    Average time : PUBLISH RTA BOUNDARY    5
   LOG_ARCHIVE_DEST_3 : VALIDATE BUFFER
    Total count  : VALIDATE BUFFER         1
    Total time   : VALIDATE BUFFER         1
    Average time : VALIDATE BUFFER         1
   LOG_ARCHIVE_DEST_4 : VALIDATE BUFFER
    Total count  : VALIDATE BUFFER         1
    Total time   : VALIDATE BUFFER         2
    Average time : VALIDATE BUFFER         2
   LOG_ARCHIVE_DEST_5 : VALIDATE BUFFER
    Total count  : VALIDATE BUFFER         1
    Total time   : VALIDATE BUFFER         1
    Average time : VALIDATE BUFFER         1
   LOG_ARCHIVE_DEST_6 : VALIDATE BUFFER
    Total count  : VALIDATE BUFFER         1
    Total time   : VALIDATE BUFFER         2
    Average time : VALIDATE BUFFER         2
   LOG_ARCHIVE_DEST_7 : VALIDATE BUFFER
    Total count  : VALIDATE BUFFER         1
    Total time   : VALIDATE BUFFER         1
    Average time : VALIDATE BUFFER         1
   LOG_ARCHIVE_DEST_8 : VALIDATE BUFFER
    Total count  : VALIDATE BUFFER         1
    Total time   : VALIDATE BUFFER         1
    Average time : VALIDATE BUFFER         1
   All TARGETS : VALIDATE BUFFER
    Total count  : VALIDATE BUFFER         6
    Total time   : VALIDATE BUFFER         8
    Average time : VALIDATE BUFFER         1
   Total Server-Side DG layer time         18446744073709334909
   Percentage of time in DG layer          616903601029120.00
krsb_end: End stats dump


Trace file also gives summary for all targets. For example total time for all the OCI requests
All TARGETS : OCI REQUEST          
    Total time   : OCI REQUEST             2816673
comes from summing of all the values for each log archive destination's OCI REQUEST total time values
   LOG_ARCHIVE_DEST_3 : OCI REQUEST          
    Total time   : OCI REQUEST             306754
   LOG_ARCHIVE_DEST_4 : OCI REQUEST          
    Total time   : OCI REQUEST             210057
   LOG_ARCHIVE_DEST_5 : OCI REQUEST          
    Total time   : OCI REQUEST             739457
   LOG_ARCHIVE_DEST_6 : OCI REQUEST          
    Total time   : OCI REQUEST             593044
   LOG_ARCHIVE_DEST_7 : OCI REQUEST          
    Total time   : OCI REQUEST             433191
   LOG_ARCHIVE_DEST_8 : OCI REQUEST          
    Total time   : OCI REQUEST             534170


For granual timing information for disk and DG layer the "Remote Server-Side *" sections of the trace output must be refered. This is something missing on the white paper. There lies the problem with this tracing. The section before the "Remote Server-Side *" (one that has similar output to one that is shown in white paper) has percentage values that add up to 100.
    Percentage of time in network           94.23
    Percentage of time in disk layer        0.02
    Percentage of time in DG layer          5.75
It is these values (source of the ASYNC transport) as per the white paper that is used to tune the transport related issues. Remote section which seems to relate to the standby instances that receive redo is not considered for tuning.
The odd part is that the percentages after "Remote Server-Side *" doesn't add up to 100.
Remote Server-Side Network Statistics
   Percentage of time in network           107.23
Remote Server-Side Disk Statistics
   Percentage of time in disk layer        0.02
Remote Server-Side Data Guard Processing Statistics
   Percentage of time in DG layer          616903601029120.00
These odd statistics were observed multiple times during tracing. Oracle support acknowledged that this is a bug, but as the original SR was on "how to trace" and as that was resolved was eager to close the SR. As such there was no further investigation on the matter.
On a separate trace test case involving just one primary and one standby the remote section did produce percentages values that made sense.
krsb_end: Begin stats dump for T-1.S-5521

  max number of buffers in use             10
  Operation elapsed time (micro seconds)   229354206
  File transfer time (micro seconds)       229354197

  Network Statistics

   LOG_ARCHIVE_DEST_2 - OCI REQUEST
    Total count  - OCI REQUEST             7820
    Total time   - OCI REQUEST             137552
    Average time - OCI REQUEST             17

   LOG_ARCHIVE_DEST_2 - NETWORK SEND
    Total count  - NETWORK SEND            7817
    Total time   - NETWORK SEND            252474
    Average time - NETWORK SEND            32

    Total data buffers queued              7816
    Total data buffers completed           7816
    Total bytes written                    10035712
    Total bytes completed synchronously    9645056
    Average network send size (blocks)     2
    Average network send buffers           1.00
    Average buffer turnaround time         770
    Throughput (MB/s)                      0.04

   LOG_ARCHIVE_DEST_2 - NETWORK NO-STALL REAP
    Total count  - NETWORK NO-STALL REAP   36
    Total bytes  - NETWORK NO-STALL REAP   359424
    Total time   - NETWORK NO-STALL REAP   370
    Average time - NETWORK NO-STALL REAP   10

   LOG_ARCHIVE_DEST_2 - NETWORK STALL REAP
    Total count  - NETWORK STALL REAP      3
    Total bytes  - NETWORK STALL REAP      31232
    Total time   - NETWORK STALL REAP      82695
    Average time - NETWORK STALL REAP      27565

   Total network layer time                473091
   Percentage of time in network           0.21

  Disk Statistics

    Total count  - DISK READ               7816
    Total time   - DISK READ               53465
    Average time - DISK READ               6

    Log buffer blocks                      19600
    Total count  - BUFFER RELEASE          7815
    Total time   - BUFFER RELEASE          7972
    Average time - BUFFER RELEASE          1

   Total disk layer time                   61437
   Percentage of time in disk layer        0.03

  Data Guard Processing Statistics

    Total count  - SLEEP                   7814
    Total time   - SLEEP                   228139616
    Average time - SLEEP                   29196

   Total DG layer time                     228819669
   Percentage of time in DG layer          99.77

  Remote Server-Side Network Statistics

   LOG_ARCHIVE_DEST_2 - NETWORK GET
    Total count  - NETWORK GET             23449
    Total bytes  - NETWORK GET             8112128
    Total time   - NETWORK GET             86691
    Average time - NETWORK GET             3

   LOG_ARCHIVE_DEST_2 - NETWORK STALL REAP
    Total count  - NETWORK STALL REAP      6450
    Total bytes  - NETWORK STALL REAP      1923584
    Total time   - NETWORK STALL REAP      224942932
    Average time - NETWORK STALL REAP      34874

   Total server-side network layer time    225029623
   Percentage of time in network           98.11

  Remote Server-Side Disk Statistics

   LOG_ARCHIVE_DEST_2 - DISK WRITE
    Total count  - DISK WRITE              7816
    Total time   - DISK WRITE              3977689
    Average time - DISK WRITE              508

   LOG_ARCHIVE_DEST_2 - DISK NOSTALL REAP
    Total count  - DISK NOSTALL REAP       7816
    Total blocks - DISK NOSTALL REAP       19601
    Total time   - DISK NOSTALL REAP       3088
    Average time - DISK NOSTALL REAP       0

   LOG_ARCHIVE_DEST_2 - BUFFER GET
    Total count  - BUFFER GET              7816
    Total time   - BUFFER GET              3059
    Average time - BUFFER GET              0

   LOG_ARCHIVE_DEST_2 - BUFFER RELEASE
    Total count  - BUFFER RELEASE          7816
    Total time   - BUFFER RELEASE          2001
    Average time - BUFFER RELEASE          0

   Total server-side disk layer time       3985837
   Percentage of time in disk layer        1.74

  Remote Server-Side Data Guard Processing Statistics

   LOG_ARCHIVE_DEST_2 - PUBLISH RTA BOUNDARY
    Total count  - PUBLISH RTA BOUNDARY    7816
    Total time   - PUBLISH RTA BOUNDARY    25116
    Average time - PUBLISH RTA BOUNDARY    3

   LOG_ARCHIVE_DEST_2 - VALIDATE BUFFER
    Total count  - VALIDATE BUFFER         7816
    Total time   - VALIDATE BUFFER         3041
    Average time - VALIDATE BUFFER         0

   Total Server-Side DG layer time         338737
   Percentage of time in DG layer          0.15

krsb_end: End stats dump


In this case the percentages on each layer were
Percentage of time in network           0.21
Percentage of time in disk layer        0.03
Percentage of time in DG layer          99.77
For remote section
Remote Server-Side Network Statistics
   Percentage of time in network           98.11
Remote Server-Side Disk Statistics
   Percentage of time in disk layer        1.74
Remote Server-Side Data Guard Processing Statistics
   Percentage of time in DG layer          0.15
So the statistics related issue appears to occur in tracing when redo is sent to multiple destinations.
In summary if async redo transport to be traced when far sync is present use event 16421 with level 2. Check the values generated by the trace file makes sense.

Monday, April 19, 2021

Enable/Disable ASM Rebalance Compat Phase

The rebalnce operation in ASM consist of three phases one of which is compact. As per 1902001.1 the compact phase move data closer to outer tracks of the disk. This could be observed during adding or removing disk from ASM diskgroup
INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID
---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------
         1            1 REBAL COMPACT   RUN           1          1       1378          0        991           0                                                       0
         1            1 REBAL REBALANCE DONE          1          1       3930       3930          0           0                                                       0
         1            1 REBAL REBUILD   DONE          1          1          0          0          0           0                                                       0



   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID
---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------
         1            1 REBAL COMPACT   REAP          1          1       1400          0          0           0                                                       0
         1            1 REBAL REBALANCE DONE          1          1       3930       3930          0           0                                                       0
         1            1 REBAL REBUILD   DONE          1          1          0          0          0           0                                                       0
As seen above the COMPACT phase run with some work and come to a conclusing (REAP status).

However, the track concept is not valid for SSD (see here and here). As such the compact phase could be completely avoided to save time on rebalance. This could be done by setting "_DISABLE_REBALANCE_COMPACT=false" on ASM instance (pre-12c systems) or by setting "_rebalance_compact'='FALSE' on the ASM diskgroup (12c and up). The key differences is (beside different names) one is for ASM instance while other is for ASM diskgroup.



When this is done the rebalance operation skip the compact phase and bring it to conclusion straight away. Same operation above with _rebalance_compact set to false on the disk group shows the folllowing.
   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID
---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------
         1            1 REBAL COMPACT   WAIT          1          1          0          0          0           0                                                       0
         1            1 REBAL REBALANCE RUN           1          1       3892       3930       1857           0                                                       0
         1            1 REBAL REBUILD   DONE          1          1          0          0          0           0                                                       0



  INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID
---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------
         1            1 REBAL COMPACT   REAP          1          1          0          0          0           0                                                       0
         1            1 REBAL REBALANCE DONE          1          1       3930       3930          0           0                                                       0
         1            1 REBAL REBUILD   DONE          1          1          0          0          0           0                                                       0
So this looks like a best practice to have on systems using flash disks. But as this is a hidden parameter and consulting Oracle support on the merit of this didn't give confidence. Oracle support's comment was "we don't have official information from development that compact phase can be disabled as per the Best practices". There maybe some validity on this as 2198773.1 shows setting _rebalance_compact to false could cause issues with regard to rebalance operation completing (doesn't mention what type of disks were used). MOS doc 1573768.1 also advises to set _DISABLE_REBALANCE_COMPACT=TRUE. However, testing above scenarios on 19c (multuple times) didn't result in any issues.
All in all there's no conclusive advise on the matter. Lot of Oracle community posts for setting _rebalance_compact to false while offical Oracle support and documents state otherwise.

Related Metalink notes
WARNING: rebalance not completed for group [ID 2198773.1]
What is ASM rebalance compact Phase and how it can be disabled [ID 1902001.1]
Can ASM “_DISABLE_REBALANCE_COMPACT=TRUE" Be Used With NetApp SAN Environment? [ID 1573768.1]

Monday, April 12, 2021

Plugging in a PDB in Data Guard Configuration

An earlier post shows how to use the STANDBY_PDB_SOURCE_FILE_DIRECTORY parameter to reuse the data files of a standby non-CDB when primary non-CDB is plugged in as a PDB. This post shows how the same parameter could help when plugging in an uplugged PDB to primary. It is assumed the PDB is already unplugged.
If a PDB is plugged into the primary and standby is unable to find the data files mentioned the redo apply on the standby will stop. Primary plugin command
 create pluggable database devpdb using '/opt/installs/pdbarc/devpdb.xml';
On standby alert log
2021-02-17T14:28:48.909759+00:00
Errors in file /opt/app/oracle/diag/rdbms/testfs2/testfs2/trace/testfs2_pr00_27573.trc:
ORA-01274: cannot add data file that was originally created as '/opt/data/TESTFS/BAFDC660C6050F62E053060C1FAC3C30/datafile/o1_mf_system_j2tb048o_.dbf'
ORA-01565: error in identifying file '/opt/data/TESTFS/BAFDC660C6050F62E053060C1FAC3C30/datafile/o1_mf_system_j2tb048o_.dbf'
ORA-27037: unable to obtain file status
Linux-x86_64 Error: 2: No such file or directory
Additional information: 7
2021-02-17T14:28:50.031565+00:00
Background Media Recovery process shutdown (testfs2)
This situation must be manually resoved either one file at time with a command similar to
run{
 set newname for datafile 63 to new;
 restore datafile 63 from service testfstns;
 switch datafile all;
 }
or PDB level
run {
 set newname for pluggable database devpdb to new;
 restore pluggable database devpdb from service PRODCDBTNS;
 switch datafile all;
}
This manual work could be avoided if the data files were made available on the standby instance prior to the plugin operation. Copy the data files of the PDB being plugged into a location the standby has access to and update the STANDBY_PDB_SOURCE_FILE_DIRECTORY parameter with the location.
$ pwd
/opt/installs/pdbfiles
$ ls
system.323.1064753757     sysaux.321.1064753757    
users.312.1064753751      undotbs1.322.1064753757     
oradbaudit.328.1064753751 ...

alter system set standby_pdb_source_file_directory='/opt/installs/pdbfiles';


Execute the plugin command in primary
create pluggable database devqa using '/opt/installs/pdbarc/devpdb.xml' copy;
Observe the alert log on the standby for detection and creation of the data files related to the PDB.
2021-02-17T15:29:19.457346+00:00
Recovery created pluggable database DEVQA
Recovery attempting to copy datafiles for pdb-DEVQA from           source dir-/opt/installs/pdbfiles
2021-02-17T15:29:25.313016+00:00
Recovery copied files for tablespace SYSTEM
Recovery successfully copied file /opt/data/TESTFS2/BB6378EF1A471937E053F00C1FAC6AAD/datafile/o1_mf_system_j2tfkm7v_.dbf from /opt/installs/pdbfiles/system.323.1064753757
DEVQA(4):Datafile 64 added to flashback set
DEVQA(4):Successfully added datafile 64 to media recovery
DEVQA(4):Datafile #64: '/opt/data/TESTFS2/BB6378EF1A471937E053F00C1FAC6AAD/datafile/o1_mf_system_j2tfkm7v_.dbf'
2021-02-17T15:29:30.214394+00:00
Recovery copied files for tablespace SYSAUX
Recovery successfully copied file /opt/data/TESTFS2/BAFDC660C6050F62E053060C1FAC3C30/datafile/o1_mf_sysaux_j2tfkm7x_.dbf from /opt/installs/pdbfiles/sysaux.321.1064753757
DEVQA(4):Datafile 65 added to flashback set
DEVQA(4):Successfully added datafile 65 to media recovery
DEVQA(4):Datafile #65: '/opt/data/TESTFS2/BAFDC660C6050F62E053060C1FAC3C30/datafile/o1_mf_sysaux_j2tfkm7x_.dbf'
2021-02-17T15:29:36.450080+00:00
Recovery copied files for tablespace UNDOTBS1
Recovery successfully copied file /opt/data/TESTFS2/BAFDC660C6050F62E053060C1FAC3C30/datafile/o1_mf_undotbs1_j2tfkm7y_.dbf from /opt/installs/pdbfiles/undotbs1.322.1064753757
DEVQA(4):Datafile 66 added to flashback set
DEVQA(4):Successfully added datafile 66 to media recovery
DEVQA(4):Datafile #66: '/opt/data/TESTFS2/BAFDC660C6050F62E053060C1FAC3C30/datafile/o1_mf_undotbs1_j2tfkm7y_.dbf'
Recovery copied files for tablespace USERS
Recovery successfully copied file /opt/data/TESTFS2/BAFDC660C6050F62E053060C1FAC3C30/datafile/o1_mf_users_j2tfkm7z_.dbf from /opt/installs/pdbfiles/users.312.1064753751
DEVQA(4):Datafile 67 added to flashback set
DEVQA(4):Successfully added datafile 67 to media recovery
DEVQA(4):Datafile #67: '/opt/data/TESTFS2/BAFDC660C6050F62E053060C1FAC3C30/datafile/o1_mf_users_j2tfkm7z_.dbf'
Recovery copied files for tablespace ORADBAUDIT
Recovery successfully copied file /opt/data/TESTFS2/BAFDC660C6050F62E053060C1FAC3C30/datafile/o1_mf_oradbaud_j2tfkm80_.dbf from /opt/installs/pdbfiles/oradbaudit.328.1064753751
DEVQA(4):Datafile 68 added to flashback set
DEVQA(4):Successfully added datafile 68 to media recovery
DEVQA(4):Datafile #68: '/opt/data/TESTFS2/BAFDC660C6050F62E053060C1FAC3C30/datafile/o1_mf_oradbaud_j2tfkm80_.dbf'


Steps and the result is the same whether active data guard is used or not.

However, this method doesn't work when PDB is plugged in using a compressed archive file (.pdb file). Making the .pdb file available in the location refered by STANDBY_PDB_SOURCE_FILE_DIRECTORY did not result in standby uncompressing it and creating the required data files.

Related Posts
Plugging in non-CDB to CDB and reusing the non-CDB Datafiles in Standby
Creating a PDB in a Data Guard Configuration

Monday, April 5, 2021

LOG_FILE_NAME_CONVERT is not defined, stop clearing ORLs

LOG_FILE_NAME_CONVERT name parameter is used to correct the file location for log files if the database is being created from another database and source database had a different directory structure. This is used in data guard configuration when standby and primary has different file structure. However, there are few limitations to using this parameter and one of them is that it cannot be used with Oracle managed files or OMF (refer 1367014.1).
However, an odd behaviour was observed in a data guard configuration where even when OMF is used there database was looking for LOG_FILE_NAME_CONVERT and if not set stopping the automatic logfile clearing.
The databases (both primary and standby) had following OMF related parameters
NAME                                 TYPE        VALUE
------------------------------------ ----------- -------
db_create_file_dest                  string      +DATA
db_create_online_log_dest_1          string
db_create_online_log_dest_2          string
db_create_online_log_dest_3          string
db_create_online_log_dest_4          string
db_create_online_log_dest_5          string
db_recovery_file_dest                string      +FRA
The *convert* parameters were not set.
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------
db_file_name_convert                 string
log_file_name_convert                string
pdb_file_name_convert                string
During the active database duplication the duplication processed detected use of OMF as seen from below output.
...
sql statement: alter system set  local_listener =  ''LISTENER_DGTEST2'' comment= '''' scope=spfile

Oracle instance shut down

connected to auxiliary database (not started)
Oracle instance started

Total System Global Area    9965665616 bytes

Fixed Size                    12684624 bytes
Variable Size               1543503872 bytes
Database Buffers            8388608000 bytes
Redo Buffers                  20869120 bytes
duplicating Online logs to Oracle Managed File (OMF) location
duplicating Datafiles to Oracle Managed File (OMF) location

contents of Memory Script:
{
   sql clone "alter system set  control_files =
  ''+DATA/DGTEST2/CONTROLFILE/current.264.1067949389'', ''+FRA/DGTEST2/CONTROLFILE/current.418.1067949389'' comment=
  ...
Data guard consists of two databases.
DGMGRL>Configuration - test_dg

  Protection Mode: MaxAvailability
  Members:
  dgtest  - Primary database
    dgtest2 - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS
With these setting on, after a switchover following was seen on the alert log of the old primary (new standby)
2021-03-23T12:55:20.314959+00:00
TT02 (PID:8514): LOG_FILE_NAME_CONVERT is not defined, stop clearing ORLs
Validate of the old primary (new standby) showed that online redo logs are not cleared
DGMGRL> validate database dgtest

  Database Role:     Physical standby database
  Primary Database:  dgtest2

  Ready for Switchover:  Yes
  Ready for Failover:    Yes (Primary Running)

  Managed by Clusterware:
    dgtest2:  YES
    dgtest :  YES

  Log Files Cleared:
    dgtest2 Standby Redo Log Files:  Cleared
    dgtest Online Redo Log Files:    Not Cleared
    dgtest Standby Redo Log Files:   Available
It was the same case when the new primary (old standby) did a switchover as well.
2021-03-23T13:02:25.800263+00:00
TT02 (PID:9657): LOG_FILE_NAME_CONVERT is not defined, stop clearing ORLs

DGMGRL> validate database dgtest2

  Database Role:     Physical standby database
  Primary Database:  dgtest

  Ready for Switchover:  Yes
  Ready for Failover:    Yes (Primary Running)

  Managed by Clusterware:
    dgtest :  YES
    dgtest2:  YES

  Log Files Cleared:
    dgtest Standby Redo Log Files:   Cleared
    dgtest2 Online Redo Log Files:   Not Cleared
    dgtest2 Standby Redo Log Files:  Available


This behaviour was observed even after applying release updates (RU) 19.3, 19.4, 19.5, 19.6 and 19.7. If the databases are in one of those RUs then only way to achieve the automatic ORL clearing is to explicitly set the log file name convert parameter. For example in this case for the primary DB dgtest
alter system set log_file_name_convert='/dgtest2/','/dgtest/' scope=spfile;
Once log_file_name_convert is set the automatic ORL clearence was observed even though MOS doc states log_file_name_convert cannot be used with OMF.

The issue is fixed on RU 19.8 and up (tested on 19.9 , 19.10). Once these RU are applied automatic ORL clearing happens as expected. If previously (in one of the affected RUs) log_file_name_convert was set then it could be removed once RU 19.8 or up is applied.
When automatic ORL clearing happens following could be seen on the alert log of the old primary
2021-03-23T13:20:23.135283+00:00
TT02 (PID:10641): Waiting for all non-current ORLs to be archived
2021-03-23T13:20:23.135414+00:00
TT02 (PID:10641): All non-current ORLs have been archived
2021-03-23T13:20:23.273380+00:00
TT02 (PID:10641): Clearing ORL LNO:1 +DATA/DGTEST/ONLINELOG/group_1.264.1067878075
2021-03-23T13:20:24.512745+00:00
TT02 (PID:10641): Clearing ORL LNO:1 complete
TT02 (PID:10641): Clearing ORL LNO:2 +DATA/DGTEST/ONLINELOG/group_2.262.1067878077
Clearing online log 2 of thread 1 sequence number 34
2021-03-23T13:20:26.525504+00:00
TT02 (PID:10641): Clearing ORL LNO:2 complete
TT02 (PID:10641): Clearing ORL LNO:3 +DATA/DGTEST/ONLINELOG/group_3.261.1067878079
Clearing online log 3 of thread 1 sequence number 35
2021-03-23T13:20:28.163683+00:00
TT02 (PID:10641): Clearing ORL LNO:3 complete
2021-03-23T13:20:33.174192+00:00
TT02 (PID:10641): Waiting for all non-current ORLs to be archived
2021-03-23T13:20:33.174386+00:00
TT02 (PID:10641): All non-current ORLs have been archived
Validating the standby shows no log files to be cleared.
DGMGRL> validate database dgtest2

  Database Role:     Physical standby database
  Primary Database:  dgtest

  Ready for Switchover:  Yes
  Ready for Failover:    Yes (Primary Running)

  Managed by Clusterware:
    dgtest :  YES
    dgtest2:  YES

Useful Metalink notes
Usage and Limitation of db_file_name_convert and log_file_name_convert [ID 1367014.1]