Friday, October 15, 2021

ORA-16856: transport lag could not be determined on Far Sync After Archive Logs Delete With DBRU 19.12 Applied

The behaviour mentioned in this post only started to appear after applying the DBRU 19.12. Prior to applying the DBRU there was no issue of ORA-16856 with the combination of conditions mentioned in the post. System has been in operation for sometime.
After applying DBRU 19.12 noticed increase number of alerts related to ORA-16856 coming from a data guard configuration that has far sync instance in it. Behaviour was observed not one but several such environments. On the data guard broker log of the primary DB entries similar to below could be seen.
2021-08-19T14:08:17.271+00:00
Data Guard Broker Status Summary:
Type Name Severity Status
Configuration tst_dg Warning ORA-16608: one or more members have warnings
Primary Database tstdb1 Success ORA-0: normal, successful completion
Physical Standby Database tstdb2 Success ORA-0: normal, successful completion
Far Sync Instance tstfs1 Warning ORA-16856: transport lag could not be determined
Physical Standby Database tstdb6 Success ORA-0: normal, successful completion
Physical Standby Database tstdb5 Success ORA-0: normal, successful completion
No ORA-16856 related entreis were shown on the far sync instance logs. The data guard broker goes to a warning state for several minutes and rectify itself aftwards.
Analysis of the data guard broker logs showed that ORA-16856 was starting at the same minute of the hour and every hour. Further investigation showed that that first occurence of ORA-16856 corresponds to archive log deletion on the far sync instance. Far sync had archive deletion policy set to applied on all standby (later it was found that archive log deletion policy has no relavence to this issue) and deleting all the archive logs.
There were no overlapping rman activities between far sync and primary or other standby instances. Neverthless the archive log deletion on far sync was shifted to a different time. After that the ORA-16856 error started to appear at the newly scheduled time for the archive log deletion on far sync. This confirmed that issue is caused by the archive log deletion on far sync.
At this stage, looked at when did the new pattern of ORA-16856 started. It was confirmed current pattern of (every hour at the same minute) ORA-16856 startd after the DBRU 19.12 was applied. As the system has been operational for number of years there was enough historic log entries to compare against to see if pattern was there before as well. It was not.
To further isolate the possible root cause, archive log deletion was stopped on far sync in one of the non critical environments that exhibit the behaviour. Once the archive log deletion was stopped there was no recurring ORA-16856 as before. This indicated the log archive deletion could be the possible root cause.
As the next step the log archive deletion command was changed from
delete noprompt archivelog all;
to
delete noprompt archivelog all completed before 'sysdate - 1/24';
With this not all the archive logs were getting deleted. Some were omitted from the deletion as they fell out side the completed before time period. After the rman comamnd change the occurence of periodic ORA-16856 also stopped.



After this it was possible to manually recreate the issue, at will, by running a full archive log deletion on the far sync. Issue doesn't happen when all archive logs are deleted on standby databases. It only happens when all archive logs are deleted on the far sync instance.
As a workaround to the problem, in production systems, on far sync instances the archive log deletion command was changed so that some archive logs would always remain.
Hypothesis was that to calculate the transport lag far sync instnace uses archive log entries. Removing all of them makes it unable to calculate the transport lag hence the ORA-16856.
Raised a SR and Oracle support asked to recreate the issue with 'TraceLevel' = 'SUPPORT' in the DG broker. With this trace level when the issue was recreated far sync DG broker log also output the line related to ORA-16856 as shown below.
2021-10-01T11:10:21.900+00:00 DMON: Checking critical status of this database.
2021-10-01T11:10:21.900+00:00 00000000 1128079456 DMON: Broker configuration file successfully loaded for EDIT CONFIGURATION RE/SET PROPERTY
2021-10-01T11:10:21.900+00:00 INSV: Reply received for message with
2021-10-01T11:10:21.901+00:00 req ID 1.1.1128079456, opcode EDIT_DRC_PROP, phase RESYNCH
2021-10-01T11:10:21.901+00:00 00000000 1128079456 DMON: rfm_release_chief_lock() called for EDIT_DRC_PROP, called from rfmspd
2021-10-01T11:11:50.183+00:00 RSM0: HEALTH CHECK WARNING: ORA-16856: transport lag could not be determined
2021-10-01T11:11:50.234+00:00 03001000 1447213588 Operation HEALTH_CHECK continuing with warning, status = ORA-16856
2021-10-01T11:11:50.235+00:00 03001000 1447213588 Operation HEALTH_CHECK continuing with warning, status = ORA-16856
SR has resulted in an internal bug 33427973. No permanent solution yet. SR continues.