Showing posts with label FCF. Show all posts
Showing posts with label FCF. Show all posts

Sunday, July 1, 2018

JDBC Client Failover in Data Guard Configuration with PDBs

This post gives the highlights of setting up JDBC client failover in a data guard configuration with PDBs. For comprehensive set of steps refer the following white papers for 12c and for 11g.
The post shows how JDBC could be setup in an application that connects to single instance database in an Oracle restart such that JDBC connection failover to standby when a switchover or failover happens. The data guard setup used for this case is the same setup mentioned in the earlier post oracle Data Guard on 12.2 CDB with Oracle Restart.
1. By default ONS is disabled and offline in Oracle restart. In order to send FAN events ONS must be enabled and started in Oracle restart. This should be done in both primary and standby nodes.
srvctl enable ons
srvctl start ons
Once done check ONS status on both primary and standby.
crsctl stat res ora.ons
NAME=ora.ons
TYPE=ora.ons.type
TARGET=ONLINE
STATE=ONLINE on city7

crsctl stat res ora.ons
NAME=ora.ons
TYPE=ora.ons.type
TARGET=ONLINE
STATE=ONLINE on city7s
When ONS is enabled, stopping HAS throws up following error
crsctl stop has
...
CRS-2673: Attempting to stop 'ora.ons' on 'city7s'
CRS-5014: Agent "ORAAGENT" timed out starting process "/opt/app/oracle/product/12.2.0/grid/opmn/bin/onsctli" for action "stop": details at "(:CLSN00009:)" in "/opt/app/oracle/diag/crs/city7s/crs/trace/ohasd_oraagent_grid.trc"
CRS-2675: Stop of 'ora.ons' on 'city7s' failed
CRS-2679: Attempting to clean 'ora.ons' on 'city7s'
CRS-2681: Clean of 'ora.ons' on 'city7s' succeeded
This only happens during stopping of HAS and no such issue during start up of HAS and ONS service gets started along with other services. This appear to be a known issue in other version relating to RAC but nothing could be found on MOS with regard to 12.2 Oracle restart. SR was raised and this is being investigated under bug 28134413. In spite of this issue the failover works as expected.
Update: 2020-01-28 - As a result of the SR raised Oracle has created MOS doc 2631403.1 which now states this is expected behavior on SIHA.

2. Create a service and associate it with a PDB for application to connect. It's important that application connect to the database using the service for failover to work in the event of role transition. When a service is created for PDB, the PDB could be brought up by starting the service. However, stopping the service doesn't bring down the PDB but only the service is stopped. Following service was created on primary PDB.
srvctl add service -db prodcdb -pdb pdbapp1 -service devsrv -role PRIMARY -notification TRUE -failovertype NONE -failovermethod NONE -failoverdelay 0 -failoverretry 0 

srvctl config service -d prodcdb -s devsrv

Service name: devsrv
Cardinality: SINGLETON
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: true
Global: false
Commit Outcome: false
Failover type: NONE
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Failover restore: NONE
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: NONE
Edition:
Pluggable database name: pdbapp1
Maximum lag time: ANY
SQL Translation Profile:
Retention: 86400 seconds
Replay Initiation Time: 300 seconds
Drain timeout:
Stop option:
Session State Consistency: DYNAMIC
GSM Flags: 0
Service is enabled
Following service was created on standby PDB to be active when standby becomes primary. The service name must be same as the service created in the primary.
srvctl add service -db stbycdb -pdb pdbapp1 -service devsrv -role PRIMARY -notification TRUE -failovertype NONE -failovermethod NONE -failoverdelay 0 -failoverretry 0

srvctl config service -d stbycdb -s devsrv
Service name: devsrv
Cardinality: SINGLETON
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: true
Global: false
Commit Outcome: false
Failover type: NONE
Failover method: NONE
TAF failover retries: 0
TAF failover delay: 0
Failover restore: NONE
Connection Load Balancing Goal: LONG
Runtime Load Balancing Goal: NONE
TAF policy specification: NONE
Edition:
Pluggable database name: pdbapp1
Maximum lag time: ANY
SQL Translation Profile:
Retention: 86400 seconds
Replay Initiation Time: 300 seconds
Drain timeout:
Stop option:
Session State Consistency: DYNAMIC
GSM Flags: 0
Service is enabled
Once service is created make sure patch for bug 26439462 (Doc ID 26439462.8) is applied. This bug prevents the bringing up of the PDB service automatically after role transition. Applying the latest RU for 12.2 (at the time of the testing it was 12.2.0.1.180417) resolved this issue. If the PDB service doesn't automatically starts then the JDBC failover will fail. This could be tested by carrying out a switchover and checking if the PDB service automatically comes up after role transition.

3. Configure the JDBC client to use UCP and enable FCF by setting ONS configuration settings. Details of this could be found on the 12c1 white paper and high availability best practice guide.



4. Create a TNS entry containing both primary and standby hosts and the service name created earlier. Use this TNS entry to connect to the database. To avoid ORA-12514 set (RETRY_COUNT x RETRY_DELAY) such that it is slightly higher than the total time for the switchover and to start the service.
DGTNS =
  (DESCRIPTION =
    (FAILOVER = on)(CONNECT_TIMEOUT=60)(RETRY_COUNT=40)(RETRY_DELAY=2)(TRANSPORT_CONNECT_TIMEOUT=1)
    (ADDRESS_LIST =
      (LOAD_BALANCE = yes)
      (ADDRESS = (PROTOCOL = TCP)(HOST = city7.domain.net)(PORT = 1581))
      (ADDRESS = (PROTOCOL = TCP)(HOST = city7s.domain.net)(PORT = 1581))
    )
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = devsrv)
    )
  )
5. Start the application and verify application server IP is listed in the ONS subscription list in each database server.

6. Do a switchover and check the application connectivity. For testing purpose a java application was created to output the connected database. The output below shows that initially it was connected to the PDB in the stbycdb CDB which was primary at that time. During the switchover time period while the standby DB is made primary and PDB and associated service is started connection could error. Once the service is up the JDBC connections succeeds.
Connected to stbycdb DB Server hpc1.domain.net Application Server on Mon May 21 09:55:50 BST 2018
Connected to stbycdb DB Server hpc1.domain.net Application Server on Mon May 21 09:55:51 BST 2018
Connected to stbycdb DB Server hpc1.domain.net Application Server on Mon May 21 09:55:52 BST 2018
Connected to prodcdb DB Server hpc1.domain.net Application Server on Mon May 21 09:56:10 BST 2018
Connected to prodcdb DB Server hpc1.domain.net Application Server on Mon May 21 09:56:11 BST 2018
Connected to prodcdb DB Server hpc1.domain.net Application Server on Mon May 21 09:56:13 BST 2018

Related Post
23ai JDBC Driver Does Not Consider RETRY_COUNT and RETRY_DELAY

Wednesday, March 2, 2011

Fast Connection Failover Behavior Change in 11.2

Fast Connection Failover (FCF) allows jdbc connections to failover to surviving nodes in a RAC when one or more nodes are unavailable. This is very useful to have with middleware application server connection pools so when rolling upgrades or patch application is done on the database the connections in the middleware server would failover to surviving instances and client connections would not receive any errors.

It seem with 11.2.0.2 and 11.2.0.1 this doesn't work same as in 11gR1 and 10gR2.

There are two metalink notes available to test fcf on 10gR2 (How to Verify and Test Fast Connection Failover (FCF) Setup from a JDBC Thin Client Against a 10.2.x RAC Cluster [ID 433827.1]) and 11gR1 (Fast Connection Failover (FCF) Test Client Using 11g JDBC Driver and 11g RAC Cluster [ID 566573.1]).

In this case the java code provided with 566573.1 was used to test the failover of 10gR2 (10.2.0.4) , 11gR1 (11.1.0.7) and 11gR2 (11.2.0.1 and 11.2.0.2)

There are some differences between 433827.1 and 566573.1 note. In 10gR2 note the ons configuration uses vip hostname whereas in the 11gR1 note machine hostname is used. But for the testing purpose vip hostnames were used throughout (and 11gR1 has been deployed on production system using the vip hostname for ons configuration and failover worked without any problem), for 11gR2 it was tested with both vip hostname and hostname but there was no change in the result.

Furthermore the ons remote port on 11gR2 is changed to 6200 (same as 10gR2) but on 11gR1 it was 6251. The onsctl ping command on 10gR2 and 11gR1 provided the remote port this is no longer the case on 11gR2 but remote port could be observed with onsctl debug.
$GI_HOME/bin/onsctl ping
ons is running ...

$GI_HOME/bin/onsctl debug
HTTP/1.1 200 OK
Content-Length: 2488
Content-Type: text/html
Response:

== rac4.domain.net:6200 3640 11/03/02 15:36:32 ==
Home: /opt/app/11.2.0/grid

======== ONS ========
IP ADDRESS                   PORT    TIME   SEQUENCE  FLAGS
--------------------------------------- ----- -------- -------- --------
192.168.0.86  6200 4d6e6210 00000002 00000008

Listener:

TYPE                BIND ADDRESS               PORT  SOCKET
-------- --------------------------------------- ----- ------
Local                                  127.0.0.1  6100      5
Remote                                       any  6200      7
Remote                                       any  6200      
ojdbc6.jar and ons.jar from a 11gR2 home was used for all the test.

The test code gets 5 jdbc connections from the connection pool and prints to which instance these connections point to and close them and waits 30 seconds and re-tries the same procedure. During the 30 second wait one of the instnaces is shutdown with
srvctl stop instance -d dbname -i instancename -o abort
If the FCF is working properly then connections to the aborted instance would have been cleaned up and all the connections in the pool will be valid connections, connected to surviving node(s).

Output from running against the 10gR2 RAC
java support.au.fcf11g.FCFDemo
>> PROGRAM using JDBC thin driver, no oracle client required
>> ojdbc5.jar (for JDK 1.5) or ojdbc6.jar (for JDK 1.6) and ons.jar from $ORACLE_HOME/opmn/lib directory must be in the CLASSPATH
>> Press CNTRL C to exit running program

** Retrieving 5 connections from pool **
5 connections have been retrieved..

=============
Database Product Name is ... Oracle
Database Product Version is  Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
=============
JDBC Driver Name is ........ Oracle JDBC driver
JDBC Driver Version is ..... 11.2.0.2.0
JDBC Driver Major Version is 11
JDBC Driver Minor Version is 2
=============

-- Connection number #1
Instance -> rac10g21
Host -> rac1
Service -> rac10g2.domain.net
-- Connection number #2
Instance -> rac10g21
Host -> rac1
Service -> rac10g2.domain.net
-- Connection number #3
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net
-- Connection number #4
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net
-- Connection number #5
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 0
-----------

Sleeping for 30 seconds..
Woke up, will continue now..

** Retrieving 5 connections from pool **
5 connections have been retrieved..

-- Connection number #1
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net
-- Connection number #2
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net
-- Connection number #3
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net
-- Connection number #4
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net
-- Connection number #5
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 0
-----------

Sleeping for 30 seconds..
The two connections that were connected to instance rac10g21 has been cleaned up and all the connections now point to rac10g22.

Output from running against 11gR1 RAC
java support.au.fcf11g.FCFDemo
>> PROGRAM using JDBC thin driver, no oracle client required
>> ojdbc5.jar (for JDK 1.5) or ojdbc6.jar (for JDK 1.6) and ons.jar from $ORACLE_HOME/opmn/lib directory must be in the CLASSPATH
>> Press CNTRL C to exit running program

** Retrieving 5 connections from pool **
5 connections have been retrieved..

=============
Database Product Name is ... Oracle
Database Product Version is  Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
=============
JDBC Driver Name is ........ Oracle JDBC driver
JDBC Driver Version is ..... 11.2.0.2.0
JDBC Driver Major Version is 11
JDBC Driver Minor Version is 2
=============

-- Connection number #1
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #2
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #3
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #4
Instance -> rac11g12
Host -> rac2
Service -> rac11g1.domain.net
-- Connection number #5
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 0
-----------

Sleeping for 30 seconds..
Woke up, will continue now..

** Retrieving 5 connections from pool **
5 connections have been retrieved..

-- Connection number #1
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #2
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #3
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #4
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #5
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 0
-----------

Sleeping for 30 seconds..
There was one connection to rac11g12 instnace and this has been cleaned up and all the connections now point to rac11g11.

Now the 11.2 RACs first against 11.2.0.1
java support.au.fcf11g.FCFDemo
>> PROGRAM using JDBC thin driver, no oracle client required
>> ojdbc5.jar (for JDK 1.5) or ojdbc6.jar (for JDK 1.6) and ons.jar from $ORACLE_HOME/opmn/lib directory must be in the CLASSPATH
>> Press CNTRL C to exit running program

** Retrieving 5 connections from pool **
5 connections have been retrieved..

=============
Database Product Name is ... Oracle
Database Product Version is  Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
=============
JDBC Driver Name is ........ Oracle JDBC driver
JDBC Driver Version is ..... 11.2.0.2.0
JDBC Driver Major Version is 11
JDBC Driver Minor Version is 2
=============

-- Connection number #1
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #2
Instance -> rac11g22
Host -> rac5
Service -> rac11g2.domain.net
-- Connection number #3
Instance -> rac11g22
Host -> rac5
Service -> rac11g2.domain.net
-- Connection number #4
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #5
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 0
-----------

Sleeping for 30 seconds..
Woke up, will continue now..

** Retrieving 5 connections from pool **
5 connections have been retrieved..

-- Connection number #1
Error running query: No more data to read from socket
-- Connection number #2
Error running query: No more data to read from socket
-- Connection number #3
Instance -> rac11g22
Host -> rac5
Service -> rac11g2.domain.net
-- Connection number #4
Error running query: No more data to read from socket
-- Connection number #5
Instance -> rac11g22
Host -> rac5
Service -> rac11g2.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 3
-----------

Sleeping for 30 seconds..
As seen from the above output the FCF hasn't happened and an error is thrown on the java program.

Running against 11.2.0.2
java support.au.fcf11g.FCFDemo
>> PROGRAM using JDBC thin driver, no oracle client required
>> ojdbc5.jar (for JDK 1.5) or ojdbc6.jar (for JDK 1.6) and ons.jar from $ORACLE_HOME/opmn/lib directory must be in the CLASSPATH
>> Press CNTRL C to exit running program

** Retrieving 5 connections from pool **
5 connections have been retrieved..

=============
Database Product Name is ... Oracle
Database Product Version is  Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
=============
JDBC Driver Name is ........ Oracle JDBC driver
JDBC Driver Version is ..... 11.2.0.2.0
JDBC Driver Major Version is 11
JDBC Driver Minor Version is 2
=============

-- Connection number #1
Instance -> rac11g22
Host -> rac5
Service -> rac11g2.domain.net
-- Connection number #2
Instance -> rac11g22
Host -> rac5
Service -> rac11g2.domain.net
-- Connection number #3
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #4
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #5
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 0
-----------

Sleeping for 30 seconds..
Woke up, will continue now..

** Retrieving 5 connections from pool **
5 connections have been retrieved..

-- Connection number #1
Error running query: No more data to read from socket
-- Connection number #2
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #3
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #4
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #5
Error running query: No more data to read from socket

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 2
-----------

Sleeping for 30 seconds..
Since the FCF is not working when the application receives connections that are connected to the instnace that was shutdown error is thrown.

SR has been raised blog will be updated with the outcome.





Update 03 March 2011

Oracle has pointed out several metalink note regarding this. First it was stated this is expected behavior and mentioned in the metalink note
How To Use JDBC FCF Feature To Detect That a RAC Node Instance Has Been Shutdown ? [ID 364005.1]This error would have thrown if a node went down after a connection is taken from the pool and before doing some work with it. But the behavior change does not lie there but how connection pool cleans up invalid connections.

Two more metalink notes were pointed out Oracle Universal Connection Pool (UCP) and Deprecation of JDBC Implicit Connection Cache [ID 743726.1] and How to Verify Universal Connection Pool (UCP) / Fast Connection Failover (FCF) Setup [ID 1064652.1]

It seem with 11gR2 implicit connectoin caching is deprecated. Tests wer done using the code provided with 1064652.1 on 11gR2 and 11gR1 and still (with this whatever implicit caching mechanism present or not present on the jdbc side using UCP was common on both test cases) the difference on connection pool clean up is there. First test was on 11gR2 RAC.
----------- UCP Details ---------
NumberOfAvailableConnections: 4 <== 4 connection in the pool
BorrowedConnectionsCount: 1
---------------------------------

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #13  -> instance[rac11g22], host[rac5], service[rac11g2.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 4
BorrowedConnectionsCount: 1
---------------------------------

====================== instance shutdown =============================

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Connection retry necessary : No more data to read from socket
** FCFTest :
** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Connection retry necessary : No more data to read from socket <== expected behavior on 11gR2
** FCFTest :
** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #16  -> instance[rac11g22], host[rac5], service[rac11g2.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 2 <== connection pool cleaned up of invalid connections
BorrowedConnectionsCount: 1
---------------------------------

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #17  -> instance[rac11g22], host[rac5], service[rac11g2.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 2
BorrowedConnectionsCount: 1
---------------------------------
Now with 11gR1
java au.support.jdbc.scan.fcf.FCFTest
Started UCP FCF Test at Thu Mar 03 11:21:03 GMT 2011
** UCPPool : retrieveConnection
** FCFTest : Query #1  -> instance[rac11g12], host[rac2], service[rac11g1.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 4 <== 4 connections in the pool
BorrowedConnectionsCount: 1
---------------------------------

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #2  -> instance[rac11g11], host[rac1], service[rac11g1.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 4
BorrowedConnectionsCount: 1
---------------------------------
================== instance shutdown and connection pool gets cleaned up immediately.

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #3  -> instance[rac11g12], host[rac2], service[rac11g1.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 1 <== pool is cleaned up immediately no connection retry
BorrowedConnectionsCount: 1
---------------------------------

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #4  -> instance[rac11g12], host[rac2], service[rac11g1.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 1
BorrowedConnectionsCount: 1
---------------------------------
SR is ongoing will be updated.

Update 11 March 2011
Running the ucp test case with logging enabled showed the failover events in 11gR2 was missing the domain part of the service name whereas in 11gR1 the full service name including domain was sent. Logs below from the test for 11gR1
** UCPPool : connection returned to pool
2011-03-07T12:15:57.904+0000 UCP FINEST seq-43,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setEventType eventType: database/event/service
2011-03-07T12:15:57.904+0000 UCP FINEST seq-44,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.validateEventType eventType: database/event/service
2011-03-07T12:15:57.904+0000 UCP FINEST seq-45,thread-11 oracle.ucp.jdbc.oracle.ONSDatabaseFailoverEvent. eventType: database/event/service, eventBody: VERSION=1.0 service=rac11g1.domain.net instance=rac11g11 database=rac11g1 host=rac1 status=down reason=user
2011-03-07T12:15:57.905+0000 UCP FINEST seq-46,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setServiceName serviceName: rac11g1.domain.net
2011-03-07T12:15:57.905+0000 UCP FINEST seq-47,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setInstanceName instanceName: rac11g11
2011-03-07T12:15:57.905+0000 UCP FINEST seq-48,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setDbUniqueName dbUniqueName: rac11g1
2011-03-07T12:15:57.906+0000 UCP FINEST seq-49,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setHostName hostName: rac1
2011-03-07T12:15:57.906+0000 UCP FINEST seq-50,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setStatus status: down
2011-03-07T12:15:57.906+0000 UCP FINEST seq-51,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setReason reason: user
On 11gR2 the domain name is missing on the service name
** UCPPool : connection returned to pool
2011-03-07T12:41:47.951+0000 UCP FINEST seq-49,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setEventType eventType: database/event/service
2011-03-07T12:41:47.951+0000 UCP FINEST seq-50,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.validateEventType eventType: database/event/service
2011-03-07T12:41:47.952+0000 UCP FINEST seq-51,thread-11 oracle.ucp.jdbc.oracle.ONSDatabaseFailoverEvent. eventType: database/event/service, eventBody: VERSION=1.0 service=rac11g2 instance=rac11g21 database=rac11g2 host=rac4 status=down reason=USER
2011-03-07T12:41:47.952+0000 UCP FINEST seq-52,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setServiceName serviceName: rac11g2
2011-03-07T12:41:47.952+0000 UCP FINEST seq-53,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setInstanceName instanceName: rac11g21
2011-03-07T12:41:47.953+0000 UCP FINEST seq-54,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setDbUniqueName dbUniqueName: rac11g2
2011-03-07T12:41:47.953+0000 UCP FINEST seq-55,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setHostName hostName: rac4
2011-03-07T12:41:47.953+0000 UCP FINEST seq-56,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setStatus status: down
2011-03-07T12:41:47.954+0000 UCP FINEST seq-57,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setReason reason: user
According to Oracle Documentation provided throught the SRthis service name should match the service name in dba_services. Looking at this dba view
SQL> select service_id,name from dba_services;

SERVICE_ID NAME
---------- ------------
1 SYS$BACKGROUND
2 SYS$USERS
3 rac11g2sXDB
4 rac11g2s.domain.net
5 rac11g2XDB
6 rac11g2.domain.net

6 rows selected.

SQL> show parameter service

NAME TYPE VALUE
------------- ------ -------------------
service_names string rac11g2.domain.net
Pointed this out to oracle and was told this is related to "RDBMS Bug 9363352: SERVICE NAME HAS NO DOMAIN FOR UP/DOWN EVENTS, closed as not a bug" and at this point Oracle couldn't reproduce the error seen above (No more data to read from socket). Looking at the service name oracle is using noticed that it is a service name without a domain.

So created service name without a domain and tested.
srvctl add service -d rac11g2 -s nodomain -r "rac11g21,rac11g22"

srvctl start service -d rac11g2 -s nodomain

srvctl status service -d rac11g2 -s nodomain
Service nodomain is running on instance(s) rac11g21,rac11g22

SQL> show parameter service

NAME TYPE VALUE
------------- ------ ----------------------------
service_names string rac11g2.domain.net,nodomain

SQL> select name from dba_services;

NAME
---------------------
SYS$BACKGROUND
SYS$USERS
rac11g2sXDB
rac11g2s.domain.net
rac11g2XDB
rac11g2.domain.net
nodomain

$ lsnrctl status

LSNRCTL for Linux: Version 11.2.0.2.0 - Production on 11-MAR-2011 10:34:57

Copyright (c) 1991, 2010, Oracle.  All rights reserved.

Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 11.2.0.2.0 - Production
Start Date                07-MAR-2011 12:48:29
Uptime                    3 days 21 hr. 46 min. 27 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /opt/app/11.2.0/grid/network/admin/listener.ora
Listener Log File         /opt/app/oracle/diag/tnslsnr/rac4/listener/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.86)(PORT=1521)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.90)(PORT=1521)))
Services Summary...
Service "+ASM" has 1 instance(s).
Instance "+ASM1", status READY, has 1 handler(s) for this service...
Service "nodomain.domain.net" has 1 instance(s).
Instance "rac11g21", status READY, has 1 handler(s) for this service...
Service "rac11g2.domain.net" has 2 instance(s).
Instance "rac11g21", status UNKNOWN, has 1 handler(s) for this service...
Instance "rac11g21", status READY, has 1 handler(s) for this service...
Ran the test with logging enabled. This time the service name has the domain name even though it was created without a one and connection pool get refresh and validated quickly just as it was with 11gR1
** UCPPool : connection returned to pool
2011-03-11T10:25:00.022+0000 UCP FINEST seq-75,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setEventType eventType: database/event/service
2011-03-11T10:25:00.022+0000 UCP FINEST seq-76,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.validateEventType eventType: database/event/service
2011-03-11T10:25:00.022+0000 UCP FINEST seq-77,thread-11 oracle.ucp.jdbc.oracle.ONSDatabaseFailoverEvent. eventType: database/event/service, eventBody: VERSION=1.0 service=nodomain.domain.net instance=rac11g22 database=rac11g2 host=rac5 status=down reason=USER
2011-03-11T10:25:00.023+0000 UCP FINEST seq-78,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setServiceName serviceName: nodomain.domain.net
2011-03-11T10:25:00.023+0000 UCP FINEST seq-79,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setInstanceName instanceName: rac11g22
2011-03-11T10:25:00.023+0000 UCP FINEST seq-80,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setDbUniqueName dbUniqueName: rac11g2
2011-03-11T10:25:00.024+0000 UCP FINEST seq-81,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setHostName hostName: rac5
2011-03-11T10:25:00.024+0000 UCP FINEST seq-82,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setStatus status: down
2011-03-11T10:25:00.024+0000 UCP FINEST seq-83,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setReason reason: user
Waiting for Oracle reply on this.

Update 06 April 2011
Oracle came back with this is similar to "unpublished bug 9740127 fixed in version 12" also mentioned couple of bugs 8779597 and 9740127 and pointed out that if a application service was created with domain name (instead of using the default database service) failover works as expected.

So created an application service (additional service as before)
srvctl add service -d rac11g2 -s withdomain.domain.net -r "rac11g21,rac11g22"
Ran the test again and the connection failover was as before.
** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #9  -> instance[rac11g21], host[rac4], service[withdomain.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 4
BorrowedConnectionsCount: 1
---------------------------------

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #10  -> instance[rac11g21], host[rac4], service[withdomain.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 0
BorrowedConnectionsCount: 1
---------------------------------
In other versions (10gR2 and 11gR1) default service with or without domain would work fine in a failover situation.
Created another database without a domain name and ran the test again and saw that failover wasn't working as expected.
** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #181  -> instance[testdb1], host[rac4], service[testdb]

----------- UCP Details ---------
NumberOfAvailableConnections: 1
BorrowedConnectionsCount: 1
---------------------------------

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Connection retry necessary : No more data to read from socket
** FCFTest : Apr 6, 2011 2:14 PM SUCCESS     Connections:(Available=1 Affected=0 FailedToProcess=0 MarkedDown=0 Closed=0)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0) cardinality=2 targetedToTear=0 tornDown=0 markedToClose=0 targetUpEventNewConns=1 upEventNewConns=1
Apr 6, 2011 2:14 PM SUCCESS      Connections:(Available=2 Affected=1 FailedToProcess=0 MarkedDown=1 Closed=1)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0)
Apr 6, 2011 2:13 PM SUCCESS     Connections:(Available=1 Affected=0 FailedToProcess=0 MarkedDown=0 Closed=0)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0) cardinality=2 targetedToTear=0 tornDown=0 markedToClose=0 targetUpEventNewConns=1 upEventNewConns=1
Apr 6, 2011 2:12 PM SUCCESS      Connections:(Available=1 Affected=1 FailedToProcess=0 MarkedDown=1 Closed=1)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0)
Apr 6, 2011 2:12 PM SUCCESS     Connections:(Available=1 Affected=0 FailedToProcess=0 MarkedDown=0 Closed=0)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0) cardinality=2 targetedToTear=0 tornDown=0 markedToClose=0 targetUpEventNewConns=1 upEventNewConns=1
Apr 6, 2011 2:11 PM SUCCESS      Connections:(Available=2 Affected=1 FailedToProcess=0 MarkedDown=1 Closed=1)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0)
Apr 6, 2011 2:11 PM SUCCESS     Connections:(Available=1 Affected=0 FailedToProcess=0 MarkedDown=0 Closed=0)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0) cardinality=2 targetedToTear=0 tornDown=0 markedToClose=0 targetUpEventNewConns=1 upEventNewConns=1
Apr 6, 2011 2:10 PM SUCCESS      Connections:(Available=3 Affected=2 FailedToProcess=0 MarkedDown=2 Closed=2)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0)
Apr 6, 2011 2:10 PM SUCCESS     Connections:(Available=2 Affected=0 FailedToProcess=0 MarkedDown=0 Closed=0)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0) cardinality=2 targetedToTear=0 tornDown=0 markedToClose=0 targetUpEventNewConns=2 upEventNewConns=2
Apr 6, 2011 2:09 PM SUCCESS      Connections:(Available=6 Affected=4 FailedToProcess=0 MarkedDown=4 Closed=4)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0)
Apr 6, 2011 2:09 PM SUCCESS     Connections:(Available=3 Affected=0 FailedToProcess=0 MarkedDown=0 Closed=0)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0) cardinality=2 targetedToTear=0 tornDown=0 markedToClose=0 targetUpEventNewConns=3 upEventNewConns=3
Apr 6, 2011 2:08 PM SUCCESS      Connections:(Available=5 Affected=2 FailedToProcess=0 MarkedDown=2 Closed=2)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0)

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #183  -> instance[testdb2], host[rac5], service[testdb]

----------- UCP Details ---------
NumberOfAvailableConnections: 0
BorrowedConnectionsCount: 1
---------------------------------
Oracle confirmed this is related to using default database service and suggested to create and use application services instead of the default database service.

Update 12 March 2012
UCP delay in processing ONS events during data guard or RAC failover using FCF [ID 1380527.1]
Apply 12596492 patch (available for 11.2.0.2 and 11.2.0.3)

Friday, July 16, 2010

ONS configuration doesn't get the full hostname

On Oracle 11gR1 clusterware after a fresh installation (11.1.0.6) and even after upgrading to 11.1.0.7 clusterware, the ONS configuration doesn't have the full hostname given by the hostname command on linux. Even if the full hostname is given during the clusterware installation. Only the 'short' hostname will appear without the domain values.
$CRS_HOME/bin/onsctl ping
Number of onsconfiguration retrieved, numcfg = 2
onscfg[0]
{node = db01, port = 6251}
Adding remote host db01:6251
onscfg[1]
{node = db02, port = 6251}
Adding remote host db02:6251
ons is running ...

$ hostname
db02.edsa.local
ONS config information could also be seen in the ocrdump as well. This mismatch is noted in metalink notes 744849.1,754619.1 and 759895.1. It is important that hostname in the ons config and hostname given by the linux hostname command match. It is vital for Fast Connection Failover (FCF metalink note 566573.1) used in java applications with jdbc.
If the hostname is a mismatch the FCF will not work.
To fix it simply remove the invalid hostname entry and add the full hostname
# racgons remove_config short_hostname
# racgons add_config full_hostname:6251