Thursday, March 24, 2011

Stuck Archiver Processes and FAL Gap Resolution Not Working

On a RAC to RAC data guard configuration (11.1.0.7) network wait class and waits on LGWR LNS seem to appear out of the blue. System has been running for a quite a number of years and there has not been any changes to network or any other hardware (NIC, cables and etc). Output from the emconsole
Drilling down to "other" wait class could see following LGWR LNS wait
The wait histogram showed a constant value of 16ms
Following was observed in the network wait class drill down
Metalink note Data Guard Wait Events (233491.1) describes these wait event as "ARCH wait on ATTACH - This wait event monitors the amount of time spent by all archive processes to spawn an RFS connection. The LGWR-LNS wait on channel wait event is for standby destinations configured with either the LGWR ASYNC or LGWR SYNC=PARALLEL attributes.
LGWR-LNS wait on channel - This wait event monitors the amount of time spent by the log writer (LGWR) process or the network server processes waiting to receive messages on KSR channels.
"

During this time there was a considerable archive gap between primary and standby and FAL gap resolution seems unable to resolve it. (FAL use to work fine).

The thought of making changes to transport related values in Oracle net (send and receive buffer) was suppressed since there has not been any hardware changes.
Metalink didn't give any more than definition for these waits.
Googling yield a forum post with a mention of metalink note Bug 5576816 - FAL gap resolution does not work with max_connection set in some scenario. This was applicable for 10.2.0.3 not 11.1.0.7 but the recommendation on the posting (found through googling) was to kill all the archive processes in primary(these will get restated as soon as they get killed). Reason given was that these archive processes were "stuck" and need restart. Looking at the emconsole it was also visible waits were happening on the archive processes.

Tried to kill the archive processors "proper way" by changing the log_archive_max_processes to 1 but this didn't kill any of the processes. Even after setting it to 1 all the archive processes were running. Then did a rolling shutdown and start up of the primary which resolved the issue.

Unfortunately the issue was back after few days on one of the nodes. This time killing the Oracle database session of the archive processes waiting for these wait events resolved it. It seem the archive processes being "stuck" is the symptom and cause could be something else.
Blog post will be updated ...

Friday, March 18, 2011

Remap Column Data on the Fly using EXPDP

Remap data is a new feature introduced in 11g releases which allows data on a column to be changed on the fly when exporting (expdp).

Few things to note. The help option gives the usage of remap_data as
REMAP_DATA
Specify a data conversion function.
For example, REMAP_DATA=EMP.EMPNO:REMAPPKG.EMPNO.
It shows only the table name and column name is required but this result in following error
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORA-39001: invalid argument value
ORA-39146: schema "" does not exist
Schema name is also required.

Also the function used to remap the data must accept the column type as argument otherwise following error will be thrown
ORA-31693: Table data object "ASANGA"."X" failed to load/unload and is being skipped due to error:
ORA-06553: PLS-306: wrong number or types of arguments in call to function_name
Also the function used must be a function within a package. Schema level function will not be used and following error will be shown
ORA-39001: invalid argument value
ORA-39232: invalid remap function: DUMP_OBFUSCATE
Below is an example package function that could be used
create or replace package obfuscate as

function dump_obfuscate(ssn in number) return number;

end;
/

create or replace package body obfuscate as

function dump_obfuscate(ssn in number) return number
is
i number;
begin
i := dbms_random.value(0,100);
return round(i);
end;

end;
/
Use the expdp syntax as
expdp asanga/asa directory=dumpdir dumpfile=asa.dmp logfile=asa.log SCHEMAS=asanga
REMAP_DATA=asanga.x.a:obfuscate.dump_obfuscate
It is also possible to remap data in multiple tables using multiple remap functions.
expdp asanga/*** directory=dumpdir dumpfile=asa.dmp logfile=asa.log SCHEMAS=asanga
REMAP_DATA=asanga.x.a:obfuscate.dump_obfuscate,asanga.y.a:obfuscate1.dump_obfuscate1


Friday, March 11, 2011

Creating and Removing ACFS

Oracle Automatic Storage Management Cluster File System (Oracle ACFS) is a multi-platform, scalable file system, and storage management technology that extends Oracle Automatic Storage Management (Oracle ASM) functionality to support customer files maintained outside of Oracle Database.

Creating ACFS


1. Set the diskgroup attribute parameters. Compatible.asm and compatible.advm should be >= 11.2
$ sqlplus  / as sysasm
SQL> alter diskgroup flash set attribute 'compatible.asm'='11.2';
SQL> alter diskgroup flash set attribute 'compatible.advm'='11.2';
2. Create a volume in the appropriate diskgroup
SQL> alter diskgroup flash add volume volume1 size 1g;
This could also be done using asmcmd. Volume device could be identified with
select volume_name,volume_device from v$asm_volume;
VOLUME_NAME VOLUME_DEVICE
-------------- ---------------------
VOLUMNE1 /dev/asm/volumne1-398
This could also confirmed with ls /dev/asm/*

3. Create a file system on the created volume
# mkfs -t acfs /dev/asm/volumne1-398
mkfs.acfs: version = 11.2.0.2.0
mkfs.acfs: on-disk version = 39.0
mkfs.acfs: volume = /dev/asm/volumne1-398
mkfs.acfs: volume size = 1073741824
mkfs.acfs: Format complete.
(Update 2013/10/30)
mkfs could also be run as grid user provided that user has the permission on the device
[grid@rhel6m1 ~]$ ls -l /dev/asm/volume1-149
brwxrwx---. 1 root asmadmin 252, 76289 Oct 30 15:31 /dev/asm/volume1-149
[grid@rhel6m1 ~]$  mkfs -t acfs /dev/asm/volume1-149
mkfs.acfs: version                   = 11.2.0.3.0
mkfs.acfs: on-disk version           = 39.0
mkfs.acfs: volume                    = /dev/asm/volume1-149
mkfs.acfs: volume size               = 1073741824
mkfs.acfs: Format complete.
4. Register the file system with
acfsutil registry -a /dev/asm/volumne1-398 /opt/acfsvol
/opt/acfsvol will be the mountpoint for this file system.

5. Mount the file system
# mount -t acfs /dev/asm/volumne1-398 /opt/acfsvol

ASMCMD> volinfo -G flash VOLUMNE1
Diskgroup Name: FLASH

Volume Name: VOLUMNE1
Volume Device: /dev/asm/volumne1-398
State: ENABLED
Size (MB): 1024
Resize Unit (MB): 256
Redundancy: UNPROT
Stripe Columns: 4
Stripe Width (K): 128
Usage: ACFS
Mountpath: /opt/acfsvol
6. Change ownership to Oracle user to allow oracle user processes to use the file system
# chown oracle:oinstall /opt/acfsvol
6. As oracle user create a file in the mount point
cd /opt/acfsvol
touch x
In a RAC system the volume will be mounted on all nodes after server reboots and requires selinux to be permissive.




Removing ACFS

1. Unmount the file system
# umount /opt/acfsvol
2. De-register the file system
# acfsutil registry -d /opt/acfsvol
acfsutil registry: successfully removed ACFS mount point /opt/acfsvol from Oracle Registry
3. Drop the volumne from the diskgroup
SQL> alter diskgroup flash drop volume volume1;

Diskgroup altered.

SQL> select volume_name,volume_device from v$asm_volume;

no rows selected

Thursday, March 10, 2011

11gR2 RAC to RAC Data Guard - 2

The first 11gR2 RAC to RAC Data Guard described the process of setting up a Data Guard between two node RAC. This blog is about another way to achieve the duplication of primary database, and was inspired by the second method of duplication given in RAC to single instance active duplication.

It is assumed all the preparatory work has been carried as described in 11gR2 RAC to RAC Data Guard. The steps below would replace step 13 in the above blog.

1. Instead of creating a pfile with just db_name create a complete pfile that will suite the standby database when it is fully operational. This could be done by creating a pfile on the primary (once data gurad related initialization parameters are set) and modifying it to suite the standby. Below is the one used in this case. The key difference is for control_files only the diskgroup locations are set
cat initrac11g2s1.ora
rac11g2s1.__db_cache_size=373293056
rac11g2s1.__java_pool_size=4194304
rac11g2s1.__large_pool_size=4194304
rac11g2s1.__oracle_base='/opt/app/oracle'#ORACLE_BASE set from environment
rac11g2s1.__pga_aggregate_target=209715200
rac11g2s1.__sga_target=633339904
rac11g2s1.__shared_io_pool_size=0
rac11g2s1.__shared_pool_size=239075328
rac11g2s1.__streams_pool_size=0
rac11g2s2.__db_cache_size=373293056
rac11g2s2.__java_pool_size=4194304
rac11g2s2.__large_pool_size=4194304
rac11g2s2.__oracle_base='/opt/app/oracle'#ORACLE_BASE set from environment
rac11g2s2.__pga_aggregate_target=209715200
rac11g2s2.__sga_target=633339904
rac11g2s2.__shared_io_pool_size=0
rac11g2s2.__shared_pool_size=239075328
rac11g2s2.__streams_pool_size=0
*.audit_file_dest='/opt/app/oracle/admin/rac11g2s/adump'
*.audit_trail='db'
*.cluster_database=true
*.compatible='11.2.0.0.0'
*.control_files='+DATA','+FLASH'
*.db_block_size=8192
*.db_create_file_dest='+DATA'
*.db_domain='domain.net'
*.db_file_name_convert='+DATA/rac11g2','+DATA/rac11g2s'
*.db_name='rac11g2'
*.db_recovery_file_dest='+FLASH'
*.db_recovery_file_dest_size=9470738432
*.db_unique_name='rac11g2s'
*.diagnostic_dest='/opt/app/oracle'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=rac11g2sXDB)'
rac11g2s1.fal_client='RAC11G2S1'
rac1g2s22.fal_client='RAC11G2S2'
*.fal_server='RAC11G21','RAC11G22'
rac11g2s1.instance_number=1
rac11g2s2.instance_number=2
*.log_archive_config='dg_config=(rac11g2,rac11g2s)'
*.log_archive_dest_2='service=RAC11G2 LGWR ASYNC NOAFFIRM max_failure=10 max_connections=5 reopen=180 valid_for=(online_logfiles,primary_role) db_uniq
ue_name=rac11g2'
*.log_archive_dest_1='location=use_db_recovery_file_dest valid_for=(all_logfiles,all_roles) db_unique_name=rac11g2s'
*.log_archive_dest_state_1='enable'
*.log_archive_dest_state_2='enable'
*.log_archive_format='%t_%s_%r.dbf'
*.log_archive_max_processes=5
*.log_file_name_convert='+FLASH/rac11g2','+FLASH/rac11g2s','+DATA/rac11g2','+DATA/rac11g2s'
*.open_cursors=300
*.pga_aggregate_target=209715200
*.processes=150
*.remote_listener='racb-scan:1521'
*.remote_login_passwordfile='exclusive'
*.sga_target=631242752
*.standby_file_management='AUTO'
rac11g2s1.undo_tablespace='UNDOTBS1'
rac11g2s2.undo_tablespace='UNDOTBS2'
2. The duplication command is now very simple as all the parameters have been set. The duplication command on step 17 in earlier blog could be replaced with below command. Also multiple rman channels could be enabled to take advantage of parallel file duplication.
run {
allocate channel ch1 device type disk connect 'sys/rac11g2db@rac11g21';
allocate channel ch2 device type disk connect 'sys/rac11g2db@rac11g22';

allocate auxiliary channel ch3 device type disk connect 'sys/rac11g2db@rac11g2s1';
allocate auxiliary channel ch4 device type disk connect 'sys/rac11g2db@rac11g2s1';

duplicate target database for standby from active database;

release channel ch1;
release channel ch2;
release channel ch3;
release channel ch4;

}
Output of executing the above command is given below.
rman target / auxiliary sys/rac11g2db@rac11g2s1

Recovery Manager: Release 11.2.0.2.0 - Production on Thu Mar 10 11:39:52 2011
Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.
connected to target database: RAC11G2 (DBID=379668842)
connected to auxiliary database: RAC11G2 (not mounted)

RMAN> run {
2> allocate channel ch1 device type disk connect 'sys/rac11g2db@rac11g21';
3> allocate channel ch2 device type disk connect 'sys/rac11g2db@rac11g22';
4>
5> allocate auxiliary channel ch3 device type disk connect 'sys/rac11g2db@rac11g2s1';
6> allocate auxiliary channel ch4 device type disk connect 'sys/rac11g2db@rac11g2s1';
7>
8> duplicate target database for standby from active database;
9>
10> release channel ch1;
11> release channel ch2;
12> release channel ch3;
13> release channel ch4;
14>
15> }

using target database control file instead of recovery catalog
allocated channel: ch1
channel ch1: SID=156 instance=rac11g21 device type=DISK

allocated channel: ch2
channel ch2: SID=26 instance=rac11g22 device type=DISK

allocated channel: ch3
channel ch3: SID=141 instance=rac11g2s1 device type=DISK

allocated channel: ch4
channel ch4: SID=19 instance=rac11g2s1 device type=DISK

Starting Duplicate Db at 10-MAR-11

contents of Memory Script:
{
backup as copy reuse
targetfile '/opt/app/oracle/product/11.2.0/db_1/dbs/orapwrac11g21' auxiliary format
'/opt/app/oracle/product/11.2.0/db_1/dbs/orapwrac11g2s1' ;
}
executing Memory Script

Starting backup at 10-MAR-11
Finished backup at 10-MAR-11

contents of Memory Script:
{
backup as copy current controlfile for standby auxiliary format '+DATA/rac11g2s/controlfile/current.257.745418483';
restore clone controlfile to '+FLASH/rac11g2s/controlfile/current.257.745418483' from
'+DATA/rac11g2s/controlfile/current.257.745418483';
sql clone "create spfile from memory";
shutdown clone immediate;
startup clone nomount;
sql clone "alter system set control_files =
''+DATA/rac11g2s/controlfile/current.257.745418483'', ''+FLASH/rac11g2s/controlfile/current.257.745418483'' comment=
''Set by RMAN'' scope=spfile";
shutdown clone immediate;
startup clone nomount;
}
executing Memory Script

Starting backup at 10-MAR-11
channel ch1: starting datafile copy
copying standby control file
output file name=/opt/app/oracle/product/11.2.0/db_1/dbs/snapcf_rac11g21.f tag=TAG20110310T114022 RECID=2 STAMP=745414824
channel ch1: datafile copy complete, elapsed time: 00:00:07
Finished backup at 10-MAR-11

Starting restore at 10-MAR-11

channel ch4: skipped, AUTOBACKUP already found
channel ch3: copied control file copy
Finished restore at 10-MAR-11

sql statement: create spfile from memory

Oracle instance shut down

connected to auxiliary database (not started)
Oracle instance started

Total System Global Area 630501376 bytes

Fixed Size 2229120 bytes
Variable Size 251661440 bytes
Database Buffers 369098752 bytes
Redo Buffers 7512064 bytes
allocated channel: ch3
channel ch3: SID=18 instance=rac11g2s1 device type=DISK
allocated channel: ch4
channel ch4: SID=141 instance=rac11g2s1 device type=DISK

sql statement: alter system set control_files = ''+DATA/rac11g2s/controlfile/current.257.745418483'', ''+FLASH/rac11g2s/controlfile/current.257.745418483'' comment= ''Set by RMAN'' scope=spfile

Oracle instance shut down

connected to auxiliary database (not started)
Oracle instance started

Total System Global Area 630501376 bytes

Fixed Size 2229120 bytes
Variable Size 251661440 bytes
Database Buffers 369098752 bytes
Redo Buffers 7512064 bytes
allocated channel: ch3
channel ch3: SID=16 instance=rac11g2s1 device type=DISK
allocated channel: ch4
channel ch4: SID=143 instance=rac11g2s1 device type=DISK

contents of Memory Script:
{
sql clone 'alter database mount standby database';
}
executing Memory Script

sql statement: alter database mount standby database
RMAN-05529: WARNING: DB_FILE_NAME_CONVERT resulted in invalid ASM names; names changed to disk group only.

contents of Memory Script:
{
set newname for tempfile 1 to
"+data";
switch clone tempfile all;
set newname for datafile 1 to
"+data";
set newname for datafile 2 to
"+data";
set newname for datafile 3 to
"+data";
set newname for datafile 4 to
"+data";
set newname for datafile 5 to
"+data";
backup as copy reuse
datafile 1 auxiliary format
"+data" datafile
2 auxiliary format
"+data" datafile
3 auxiliary format
"+data" datafile
4 auxiliary format
"+data" datafile
5 auxiliary format
"+data" ;
sql 'alter system archive log current';
}
executing Memory Script

executing command: SET NEWNAME

renamed tempfile 1 to +data in control file

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

Starting backup at 10-MAR-11
channel ch1: starting datafile copy
input datafile file number=00002 name=+DATA/rac11g2/datafile/sysaux.257.740770047
channel ch2: starting datafile copy
input datafile file number=00001 name=+DATA/rac11g2/datafile/system.256.740770045
output file name=+DATA/rac11g2s/datafile/system.264.745418537 tag=TAG20110310T114116
channel ch2: datafile copy complete, elapsed time: 00:02:27
channel ch2: starting datafile copy
input datafile file number=00003 name=+DATA/rac11g2/datafile/undotbs1.258.740770049
output file name=+DATA/rac11g2s/datafile/sysaux.275.745418537 tag=TAG20110310T114116
channel ch1: datafile copy complete, elapsed time: 00:02:34
channel ch1: starting datafile copy
input datafile file number=00005 name=+DATA/rac11g2/datafile/undotbs2.264.740770355
output file name=+DATA/rac11g2s/datafile/undotbs1.274.745418685 tag=TAG20110310T114116
channel ch2: datafile copy complete, elapsed time: 00:00:35
channel ch2: starting datafile copy
input datafile file number=00004 name=+DATA/rac11g2/datafile/users.259.740770049
output file name=+DATA/rac11g2s/datafile/undotbs2.273.745418693 tag=TAG20110310T114116
channel ch1: datafile copy complete, elapsed time: 00:00:31
output file name=+DATA/rac11g2s/datafile/users.272.745418721 tag=TAG20110310T114116
channel ch2: datafile copy complete, elapsed time: 00:00:03
Finished backup at 10-MAR-11

sql statement: alter system archive log current

contents of Memory Script:
{
switch clone datafile all;
}
executing Memory Script

datafile 1 switched to datafile copy
input datafile copy RECID=2 STAMP=745418726 file name=+DATA/rac11g2s/datafile/system.264.745418537
datafile 2 switched to datafile copy
input datafile copy RECID=3 STAMP=745418726 file name=+DATA/rac11g2s/datafile/sysaux.275.745418537
datafile 3 switched to datafile copy
input datafile copy RECID=4 STAMP=745418726 file name=+DATA/rac11g2s/datafile/undotbs1.274.745418685
datafile 4 switched to datafile copy
input datafile copy RECID=5 STAMP=745418726 file name=+DATA/rac11g2s/datafile/users.272.745418721
datafile 5 switched to datafile copy
input datafile copy RECID=6 STAMP=745418726 file name=+DATA/rac11g2s/datafile/undotbs2.273.745418693
Finished Duplicate Db at 10-MAR-11

released channel: ch1

released channel: ch2

released channel: ch3

released channel: ch4
3. Since spfile is not copied from the primary the spfile generated during the duplication won't have any primary db related initializaton parametrs but it will have all the hidden parameters listed same as in rac to single instance duplication. Create a pfile from this genearted spfile and copy the control_files entry added by rman during duplication to initial pfile (initrac11g2s1.ora). During the duplication the control_files entry would have the full OMF path to the controlfiles.
*.cluster_database=TRUE
*.compatible='11.2.0.0.0'
*.control_files='+DATA/rac11g2s/controlfile/current.257.745423373','+FLASH/rac11g2s/controlfile/current.257.745423373'#Set by RMAN
*.core_dump_dest='/opt/app/oracle/diag/rdbms/rac11g2s/rac11g2s1/cdump'
*.db_block_size=8192
4. Once the above control_files line is copied to the initial pfile (initrac11g2s1.ora) shutdown the database and start the database in mount mode using this pfile.
shutdown immediate;
startup mount pfile='?/dbs/initrac11g2s1.ora';
5. Create the spfile for all instances with
SQL> create spfile='+DATA/rac11g2s/spfilerac11g2s.ora' from pfile='/opt/app/oracle/product/11.2.0/db_1/dbs/initrac11g2s1.ora';
For making the standby database cluster aware continue from step 22 onwards in the earlier blog.

Wednesday, March 9, 2011

Upgrading from 11.2.0.1 to 11.2.0.2

Updated 14th October 2011 : More metalink notes

Unlike patchsets of previous releases 11.2 patchsets are full releases, which means they could be used for a fresh installation (no need to go from 11.2.0.1 to 11.2.0.2).
Following text is from various Oracle documentations.

"Starting with the 11.2.0.2 patch set, Oracle Database patch sets are full installations of the Oracle Database software. This means that you do not need to install Oracle Database 11g Release 2 (11.2.0.1) before installing Oracle Database 11g Release 2 (11.2.0.2).

Note the following changes with the new patch set packaging:

New installations consist of installing the most recent patch set, rather than installing a base release and then upgrading to a patch release.

Direct upgrades from previous releases to the most recent patch set are supported.

Out-of-place patch set upgrades recommended, in which you install the patch set into a new, separate Oracle home. In-place upgrades are supported, but not recommended.

Oracle Clusterware and Oracle ASM upgrades are always out-of-place upgrades. With 11g release 2 (11.2), you cannot perform an in-place upgrade of Oracle Clusterware and Oracle ASM to existing homes.

Please note that 11.2 Patch Sets 11.2.0.2 and higher are supplied as full releases. See Note:1189783.1 for details. Also note that 11.2.0.2 software was reissued on 17th Nov 2010 as described in Note:1266978.1 (1179474.1)
"

Patchset comes in serveral bundles and don't have to download all of them unless required.

Oracle Database (includes Oracle Database and Oracle RAC) Note: you must download both zip files to install Oracle Database.
p10098816_112020_platform_1of7.zip
p10098816_112020_platform_2of7.zip

Oracle Grid Infrastructure (includes Oracle ASM, Oracle Clusterware, and Oracle Restart)
p10098816_112020_platform_3of7.zip

Oracle Database Client
p10098816_112020_platform_4of7.zip

Oracle Gateways
p10098816_112020_platform_5of7.zip

Oracle Examples
p10098816_112020_platform_6of7.zip

Deinstall
p10098816_112020_platform_7of7.zip


Another change is that patchsets don't have a readme.html or similar document with how to apply the patchset (which was avaibale with previous releases). For upgrade information Oracle (GI / Database) Administrator Guides must be refered. Admin Guide states "To upgrade existing 11.2.0.1 Oracle Grid Infrastructure installations to Oracle Grid Infrastructure 11.2.0.2 using a rolling upgrade, you must first do one of the following, depending on your platform:

Patch the release 11.2.0.1 Oracle Grid Infrastructure home with the 9413827 patch, and install Oracle Grid Infrastructure Patchset Update 1 (GI PSU1). When you apply patch 9413827, it shows up in the inventory as GIPSU2 bug 9655006.

Install Oracle Grid Infrastructure Patchset Update 2 (GI PSU2), which includes the 9413827 patch.
"

Failure to apply these patches could result in number of issues and these have been listed in following metalink notes.

Bug 9413827 - CRS rolling upgrade from 11.2.0.1 to 11.2.0.2 fails with OCR on ASM [ID 9413827.8]

Bug 10036834 - Linux Platforms: Patches not found upgrading Grid Infrastructure from 11.2.0.1 to 11.2.0.2 [ID 10036834.8]

Pre-requsite for 11.2.0.1 to 11.2.0.2 ASM Rolling Upgrade [ID 1274629.1]



If "Patch 9655006 - 11.2.0.1.2 for Grid Infrastructure (GI) Patch Set Update" is applied on the GI home then applying 9413827 or 9706490 (as per ID 1274629.1) will always try to rollback the 9655006.With 9413827
opatch napply -local -oh $CRS_HOME -id 9413827
..
..
Conflicts/Supersets for each patch are:

Patch : 9413827

Bug Superset of 9655006
Super set bugs are:
9655006,  9778840,  9343627,  9783609,  9262748,  9262722

Patches [   9655006 ] will be rolled back.
This is a conflict with the information given on 10036834.8. With 9706490
opatch napply -local -oh  $CRS_HOME -id 9706490
..
..
Conflicts/Supersets for each patch are:

Patch : 9706490

Bug Superset of 9655006
Super set bugs are:
9655006,  9778840,  9343627,  9783609,  9262748,  9262722

Patches [   9655006 ] will be rolled back.
For this upgrade the base installation of 11.2.0.1 was applied with the GI PSU 2 (Patch 9655006 - 11.2.0.1.2 for Grid Infrastructure (GI) Patch Set Update) and then 9706490 and finally 11.2.0.1.4 Patch Set Update was applied to the Oracle Home (Not GI Home). Steps below are from this point onwards.

Unzip the grid infrastructure patchset and execute runInstaller. Since this is an out of place upgrade the current cluster could be up and running.

On the second step select upgrade GI and ASM.
Select the nodes for upgrade, for this test it's a single node cluster

Select a new location to install the patchset.

Fix any pre-req warning and start the installation

After the installation is complete rootupgrade script needs to be run on the nodes involved. Up until this the cluster was running and during rootupgrade it will be brought down (via the rootupgrade script, no manual shutdown is required) and backup again once the upgrade is complete.

/flash/11.2.0/grid11.2.0.2/rootupgrade.sh
Running Oracle 11g root script...

The following environment variables are set as:
ORACLE_OWNER= oracle
ORACLE_HOME=  /flash/11.2.0/grid11.2.0.2

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of "dbhome" have not changed. No need to overwrite.
The file "oraenv" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]: y
Copying oraenv to /usr/local/bin ...
The file "coraenv" already exists in /usr/local/bin.  Overwrite it? (y/n)
[n]: y
Copying coraenv to /usr/local/bin ...

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /flash/11.2.0/grid11.2.0.2/crs/install/crsconfig_params
Creating trace directory
Failed to add (property/value):('OLD_OCR_ID/'-1') for checkpoint:ROOTCRS_OLDHOMEINFO.Error code is 256

ASM upgrade has started on first node.

CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hpc3'
CRS-2673: Attempting to stop 'ora.crsd' on 'hpc3'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'hpc3'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'hpc3'
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'hpc3'
CRS-2673: Attempting to stop 'ora.CLUSTERDG.dg' on 'hpc3'
CRS-2673: Attempting to stop 'ora.clusdb.db' on 'hpc3'
CRS-2673: Attempting to stop 'ora.registry.acfs' on 'hpc3'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'hpc3' succeeded
CRS-2673: Attempting to stop 'ora.hpc3.vip' on 'hpc3'
CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'hpc3' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'hpc3'
CRS-2677: Stop of 'ora.hpc3.vip' on 'hpc3' succeeded
CRS-2677: Stop of 'ora.scan1.vip' on 'hpc3' succeeded
CRS-2677: Stop of 'ora.registry.acfs' on 'hpc3' succeeded
CRS-2677: Stop of 'ora.CLUSTERDG.dg' on 'hpc3' succeeded
CRS-2677: Stop of 'ora.clusdb.db' on 'hpc3' succeeded
CRS-2673: Attempting to stop 'ora.DATA.dg' on 'hpc3'
CRS-2673: Attempting to stop 'ora.FLASH.dg' on 'hpc3'
CRS-2677: Stop of 'ora.DATA.dg' on 'hpc3' succeeded
CRS-2677: Stop of 'ora.FLASH.dg' on 'hpc3' succeeded
CRS-2673: Attempting to stop 'ora.asm' on 'hpc3'
CRS-2677: Stop of 'ora.asm' on 'hpc3' succeeded
CRS-2673: Attempting to stop 'ora.ons' on 'hpc3'
CRS-2673: Attempting to stop 'ora.eons' on 'hpc3'
CRS-2677: Stop of 'ora.ons' on 'hpc3' succeeded
CRS-2673: Attempting to stop 'ora.net1.network' on 'hpc3'
CRS-2677: Stop of 'ora.net1.network' on 'hpc3' succeeded
CRS-2677: Stop of 'ora.eons' on 'hpc3' succeeded
CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'hpc3' has completed
CRS-2677: Stop of 'ora.crsd' on 'hpc3' succeeded
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'hpc3'
CRS-2673: Attempting to stop 'ora.ctssd' on 'hpc3'
CRS-2673: Attempting to stop 'ora.evmd' on 'hpc3'
CRS-2673: Attempting to stop 'ora.asm' on 'hpc3'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'hpc3'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'hpc3'
CRS-2677: Stop of 'ora.cssdmonitor' on 'hpc3' succeeded
CRS-2677: Stop of 'ora.evmd' on 'hpc3' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'hpc3' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'hpc3' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'hpc3' succeeded
CRS-2677: Stop of 'ora.asm' on 'hpc3' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'hpc3'
CRS-2677: Stop of 'ora.cssd' on 'hpc3' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'hpc3'
CRS-2673: Attempting to stop 'ora.diskmon' on 'hpc3'
CRS-2677: Stop of 'ora.gpnpd' on 'hpc3' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'hpc3'
CRS-2677: Stop of 'ora.gipcd' on 'hpc3' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'hpc3' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'hpc3' has completed
CRS-4133: Oracle High Availability Services has been stopped.
Successfully deleted 1 keys from OCR.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
OLR initialization - successful
Adding daemon to inittab
ACFS-9200: Supported
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9312: Existing ADVM/ACFS installation detected.
ACFS-9314: Removing previous ADVM/ACFS installation.
ACFS-9315: Previous ADVM/ACFS components successfully removed.
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies - this may take some time.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9309: ADVM/ACFS installation correctness verified.
clscfg: EXISTING configuration version 5 detected.
clscfg: version 5 is 11g Release 2.
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
Started to upgrade the Oracle Clusterware. This operation may take a few minutes.
Started to upgrade the CSS.
Started to upgrade the CRS.
The CRS was successfully upgraded.
Oracle Clusterware operating version was successfully set to 11.2.0.2.0
ASM upgrade has finished on last node.

Preparing packages for installation...
cvuqdisk-1.0.9-1
Configure Oracle Grid Infrastructure for a Cluster ... succeeded
The error line
Failed to add (property/value):('OLD_OCR_ID/'-1') for checkpoint:ROOTCRS_OLDHOMEINFO.Error code is 256
is similar to part of the error message give on 10036834.8 but after a SR was raised Oracle informed this is due to "Bug 10056593: FAIL TO ADD OLD_OCR_ID PROPERTY FOR ROOTCRS_OLDHOMEINFO. As per this bug the message can be ignored. You can continue with the installation."

Following commands shows that GI has been upgraded to 11.2.0.2
crsctl query crs softwareversion
Oracle Clusterware version on node [hpc3] is [11.2.0.2.0]

crsctl query crs releaseversion
Oracle High Availability Services release version on the local node is [11.2.0.2.0]

crsctl query crs activeversion
Oracle Clusterware active version on the cluster is [11.2.0.2.0]
It is important that if GI Home has been set on environment variables and PATH variable then to change it to point to new home, if not commands would be running out of the "old" GI Home.

Check the inventory.xml to see if the new GI Home has been set properly. Before upgrade
<HOME NAME="Ora11g_gridinfrahome1" LOC="/flash/11.2.0/grid11.2.0.1" TYPE="O" IDX="1" CRS="true">
<NODE_LIST>
<NODE NAME="hpc3"/>
</NODE_LIST>
</HOME>
After upgrade
<HOME NAME="Ora11g_gridinfrahome1" LOC="/flash/11.2.0/grid11.2.0.2" TYPE="O" IDX="1" CRS="true">
<NODE_LIST>
<NODE NAME="hpc3"/>
</NODE_LIST>
</HOME>



Next upgrade the Oracle Home and the database. With 11.2 patchset this could be done in one step where software is installed and at the end dbua will get run from the same session (no need to manually run dbua). Same as GI Home it is better to have a out of place upgrade for Oralce Home, trying to install on the same location would result in the following.

After running runInstaller from database patchset home, select upgrade existing database option

Select the nodes involved and start the install.

After the software has been installed (roughly around 86% mark on the progress bar) dbua would get run prompting to upgrade the database. This will have in addition to the usual option, an option to upgrade the time zone data, which will upgrade the time zone from 11 to 14.
Below is the summary before and after the upgrade.



Once the upgrade is complete the runInstaller will complete and concludes the upgrade process. As with GI Home set any environment variables to new Oracle Home.

Now it is possible to deinstall the "old" GI Home and Oracle Home. Either the deinstall within those Homes or an external deinstall installation (unzipping p10098816_112020_platform_7of7.zip) could be used for this.
cd /opt/app/oracle/product/11.2.0/clusdb_1/deinstall
unset ORACLE_HOME
./deinstall -local
..
Checking for existence of the Oracle home location /opt/app/oracle/product/11.2.0/clusdb_1
Oracle Home type selected for de-install is: RACDB
Oracle Base selected for de-install is: /opt/app/oracle
Checking for existence of central inventory location /opt/app/oraInventory
Checking for existence of the Oracle Grid Infrastructure home /opt/app/oracle/product/11.2.0/clusdb_1
The following nodes are part of this cluster: hpc3
..
Specify the list of database names that are configured in this Oracle home []:
Database Check Configuration END
..
Oracle Grid Infrastructure Home is: /opt/app/oracle/product/11.2.0/clusdb_1
The cluster node(s) on which the Oracle home exists are: (Please input nodes seperated by ",", eg: node1,node2,...)hpc3
Since -local option has been specified, the Oracle home will be de-installed only on the local node, 'hpc3', and the global configuration will be removed.
Oracle Home selected for de-install is: /opt/app/oracle/product/11.2.0/clusdb_1
It is imporant to note that database should not be running out of this Oracle Home. Similarly GI Home could also be deinstalled. In some cases during deinstall the directories inside GI Home may not get deleted and error message "directory in use" could be seen. But all the files inside the directories would be deleted (except in bin directory) and GI Home would be listed in inventory.xml as removed. In this case the "old" GI Home could be deleted with OS command.

Wednesday, March 2, 2011

Fast Connection Failover Behavior Change in 11.2

Fast Connection Failover (FCF) allows jdbc connections to failover to surviving nodes in a RAC when one or more nodes are unavailable. This is very useful to have with middleware application server connection pools so when rolling upgrades or patch application is done on the database the connections in the middleware server would failover to surviving instances and client connections would not receive any errors.

It seem with 11.2.0.2 and 11.2.0.1 this doesn't work same as in 11gR1 and 10gR2.

There are two metalink notes available to test fcf on 10gR2 (How to Verify and Test Fast Connection Failover (FCF) Setup from a JDBC Thin Client Against a 10.2.x RAC Cluster [ID 433827.1]) and 11gR1 (Fast Connection Failover (FCF) Test Client Using 11g JDBC Driver and 11g RAC Cluster [ID 566573.1]).

In this case the java code provided with 566573.1 was used to test the failover of 10gR2 (10.2.0.4) , 11gR1 (11.1.0.7) and 11gR2 (11.2.0.1 and 11.2.0.2)

There are some differences between 433827.1 and 566573.1 note. In 10gR2 note the ons configuration uses vip hostname whereas in the 11gR1 note machine hostname is used. But for the testing purpose vip hostnames were used throughout (and 11gR1 has been deployed on production system using the vip hostname for ons configuration and failover worked without any problem), for 11gR2 it was tested with both vip hostname and hostname but there was no change in the result.

Furthermore the ons remote port on 11gR2 is changed to 6200 (same as 10gR2) but on 11gR1 it was 6251. The onsctl ping command on 10gR2 and 11gR1 provided the remote port this is no longer the case on 11gR2 but remote port could be observed with onsctl debug.
$GI_HOME/bin/onsctl ping
ons is running ...

$GI_HOME/bin/onsctl debug
HTTP/1.1 200 OK
Content-Length: 2488
Content-Type: text/html
Response:

== rac4.domain.net:6200 3640 11/03/02 15:36:32 ==
Home: /opt/app/11.2.0/grid

======== ONS ========
IP ADDRESS                   PORT    TIME   SEQUENCE  FLAGS
--------------------------------------- ----- -------- -------- --------
192.168.0.86  6200 4d6e6210 00000002 00000008

Listener:

TYPE                BIND ADDRESS               PORT  SOCKET
-------- --------------------------------------- ----- ------
Local                                  127.0.0.1  6100      5
Remote                                       any  6200      7
Remote                                       any  6200      
ojdbc6.jar and ons.jar from a 11gR2 home was used for all the test.

The test code gets 5 jdbc connections from the connection pool and prints to which instance these connections point to and close them and waits 30 seconds and re-tries the same procedure. During the 30 second wait one of the instnaces is shutdown with
srvctl stop instance -d dbname -i instancename -o abort
If the FCF is working properly then connections to the aborted instance would have been cleaned up and all the connections in the pool will be valid connections, connected to surviving node(s).

Output from running against the 10gR2 RAC
java support.au.fcf11g.FCFDemo
>> PROGRAM using JDBC thin driver, no oracle client required
>> ojdbc5.jar (for JDK 1.5) or ojdbc6.jar (for JDK 1.6) and ons.jar from $ORACLE_HOME/opmn/lib directory must be in the CLASSPATH
>> Press CNTRL C to exit running program

** Retrieving 5 connections from pool **
5 connections have been retrieved..

=============
Database Product Name is ... Oracle
Database Product Version is  Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
=============
JDBC Driver Name is ........ Oracle JDBC driver
JDBC Driver Version is ..... 11.2.0.2.0
JDBC Driver Major Version is 11
JDBC Driver Minor Version is 2
=============

-- Connection number #1
Instance -> rac10g21
Host -> rac1
Service -> rac10g2.domain.net
-- Connection number #2
Instance -> rac10g21
Host -> rac1
Service -> rac10g2.domain.net
-- Connection number #3
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net
-- Connection number #4
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net
-- Connection number #5
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 0
-----------

Sleeping for 30 seconds..
Woke up, will continue now..

** Retrieving 5 connections from pool **
5 connections have been retrieved..

-- Connection number #1
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net
-- Connection number #2
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net
-- Connection number #3
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net
-- Connection number #4
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net
-- Connection number #5
Instance -> rac10g22
Host -> rac2
Service -> rac10g2.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 0
-----------

Sleeping for 30 seconds..
The two connections that were connected to instance rac10g21 has been cleaned up and all the connections now point to rac10g22.

Output from running against 11gR1 RAC
java support.au.fcf11g.FCFDemo
>> PROGRAM using JDBC thin driver, no oracle client required
>> ojdbc5.jar (for JDK 1.5) or ojdbc6.jar (for JDK 1.6) and ons.jar from $ORACLE_HOME/opmn/lib directory must be in the CLASSPATH
>> Press CNTRL C to exit running program

** Retrieving 5 connections from pool **
5 connections have been retrieved..

=============
Database Product Name is ... Oracle
Database Product Version is  Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
=============
JDBC Driver Name is ........ Oracle JDBC driver
JDBC Driver Version is ..... 11.2.0.2.0
JDBC Driver Major Version is 11
JDBC Driver Minor Version is 2
=============

-- Connection number #1
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #2
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #3
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #4
Instance -> rac11g12
Host -> rac2
Service -> rac11g1.domain.net
-- Connection number #5
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 0
-----------

Sleeping for 30 seconds..
Woke up, will continue now..

** Retrieving 5 connections from pool **
5 connections have been retrieved..

-- Connection number #1
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #2
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #3
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #4
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net
-- Connection number #5
Instance -> rac11g11
Host -> rac1
Service -> rac11g1.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 0
-----------

Sleeping for 30 seconds..
There was one connection to rac11g12 instnace and this has been cleaned up and all the connections now point to rac11g11.

Now the 11.2 RACs first against 11.2.0.1
java support.au.fcf11g.FCFDemo
>> PROGRAM using JDBC thin driver, no oracle client required
>> ojdbc5.jar (for JDK 1.5) or ojdbc6.jar (for JDK 1.6) and ons.jar from $ORACLE_HOME/opmn/lib directory must be in the CLASSPATH
>> Press CNTRL C to exit running program

** Retrieving 5 connections from pool **
5 connections have been retrieved..

=============
Database Product Name is ... Oracle
Database Product Version is  Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
=============
JDBC Driver Name is ........ Oracle JDBC driver
JDBC Driver Version is ..... 11.2.0.2.0
JDBC Driver Major Version is 11
JDBC Driver Minor Version is 2
=============

-- Connection number #1
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #2
Instance -> rac11g22
Host -> rac5
Service -> rac11g2.domain.net
-- Connection number #3
Instance -> rac11g22
Host -> rac5
Service -> rac11g2.domain.net
-- Connection number #4
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #5
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 0
-----------

Sleeping for 30 seconds..
Woke up, will continue now..

** Retrieving 5 connections from pool **
5 connections have been retrieved..

-- Connection number #1
Error running query: No more data to read from socket
-- Connection number #2
Error running query: No more data to read from socket
-- Connection number #3
Instance -> rac11g22
Host -> rac5
Service -> rac11g2.domain.net
-- Connection number #4
Error running query: No more data to read from socket
-- Connection number #5
Instance -> rac11g22
Host -> rac5
Service -> rac11g2.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 3
-----------

Sleeping for 30 seconds..
As seen from the above output the FCF hasn't happened and an error is thrown on the java program.

Running against 11.2.0.2
java support.au.fcf11g.FCFDemo
>> PROGRAM using JDBC thin driver, no oracle client required
>> ojdbc5.jar (for JDK 1.5) or ojdbc6.jar (for JDK 1.6) and ons.jar from $ORACLE_HOME/opmn/lib directory must be in the CLASSPATH
>> Press CNTRL C to exit running program

** Retrieving 5 connections from pool **
5 connections have been retrieved..

=============
Database Product Name is ... Oracle
Database Product Version is  Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
=============
JDBC Driver Name is ........ Oracle JDBC driver
JDBC Driver Version is ..... 11.2.0.2.0
JDBC Driver Major Version is 11
JDBC Driver Minor Version is 2
=============

-- Connection number #1
Instance -> rac11g22
Host -> rac5
Service -> rac11g2.domain.net
-- Connection number #2
Instance -> rac11g22
Host -> rac5
Service -> rac11g2.domain.net
-- Connection number #3
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #4
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #5
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 0
-----------

Sleeping for 30 seconds..
Woke up, will continue now..

** Retrieving 5 connections from pool **
5 connections have been retrieved..

-- Connection number #1
Error running query: No more data to read from socket
-- Connection number #2
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #3
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #4
Instance -> rac11g21
Host -> rac4
Service -> rac11g2.domain.net
-- Connection number #5
Error running query: No more data to read from socket

-- Displaying Cache Details --
NumberOfAvailableConnections: 0
NumberOfActiveConnections: 5
Number of query failures: 2
-----------

Sleeping for 30 seconds..
Since the FCF is not working when the application receives connections that are connected to the instnace that was shutdown error is thrown.

SR has been raised blog will be updated with the outcome.





Update 03 March 2011

Oracle has pointed out several metalink note regarding this. First it was stated this is expected behavior and mentioned in the metalink note
How To Use JDBC FCF Feature To Detect That a RAC Node Instance Has Been Shutdown ? [ID 364005.1]This error would have thrown if a node went down after a connection is taken from the pool and before doing some work with it. But the behavior change does not lie there but how connection pool cleans up invalid connections.

Two more metalink notes were pointed out Oracle Universal Connection Pool (UCP) and Deprecation of JDBC Implicit Connection Cache [ID 743726.1] and How to Verify Universal Connection Pool (UCP) / Fast Connection Failover (FCF) Setup [ID 1064652.1]

It seem with 11gR2 implicit connectoin caching is deprecated. Tests wer done using the code provided with 1064652.1 on 11gR2 and 11gR1 and still (with this whatever implicit caching mechanism present or not present on the jdbc side using UCP was common on both test cases) the difference on connection pool clean up is there. First test was on 11gR2 RAC.
----------- UCP Details ---------
NumberOfAvailableConnections: 4 <== 4 connection in the pool
BorrowedConnectionsCount: 1
---------------------------------

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #13  -> instance[rac11g22], host[rac5], service[rac11g2.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 4
BorrowedConnectionsCount: 1
---------------------------------

====================== instance shutdown =============================

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Connection retry necessary : No more data to read from socket
** FCFTest :
** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Connection retry necessary : No more data to read from socket <== expected behavior on 11gR2
** FCFTest :
** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #16  -> instance[rac11g22], host[rac5], service[rac11g2.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 2 <== connection pool cleaned up of invalid connections
BorrowedConnectionsCount: 1
---------------------------------

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #17  -> instance[rac11g22], host[rac5], service[rac11g2.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 2
BorrowedConnectionsCount: 1
---------------------------------
Now with 11gR1
java au.support.jdbc.scan.fcf.FCFTest
Started UCP FCF Test at Thu Mar 03 11:21:03 GMT 2011
** UCPPool : retrieveConnection
** FCFTest : Query #1  -> instance[rac11g12], host[rac2], service[rac11g1.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 4 <== 4 connections in the pool
BorrowedConnectionsCount: 1
---------------------------------

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #2  -> instance[rac11g11], host[rac1], service[rac11g1.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 4
BorrowedConnectionsCount: 1
---------------------------------
================== instance shutdown and connection pool gets cleaned up immediately.

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #3  -> instance[rac11g12], host[rac2], service[rac11g1.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 1 <== pool is cleaned up immediately no connection retry
BorrowedConnectionsCount: 1
---------------------------------

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #4  -> instance[rac11g12], host[rac2], service[rac11g1.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 1
BorrowedConnectionsCount: 1
---------------------------------
SR is ongoing will be updated.

Update 11 March 2011
Running the ucp test case with logging enabled showed the failover events in 11gR2 was missing the domain part of the service name whereas in 11gR1 the full service name including domain was sent. Logs below from the test for 11gR1
** UCPPool : connection returned to pool
2011-03-07T12:15:57.904+0000 UCP FINEST seq-43,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setEventType eventType: database/event/service
2011-03-07T12:15:57.904+0000 UCP FINEST seq-44,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.validateEventType eventType: database/event/service
2011-03-07T12:15:57.904+0000 UCP FINEST seq-45,thread-11 oracle.ucp.jdbc.oracle.ONSDatabaseFailoverEvent. eventType: database/event/service, eventBody: VERSION=1.0 service=rac11g1.domain.net instance=rac11g11 database=rac11g1 host=rac1 status=down reason=user
2011-03-07T12:15:57.905+0000 UCP FINEST seq-46,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setServiceName serviceName: rac11g1.domain.net
2011-03-07T12:15:57.905+0000 UCP FINEST seq-47,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setInstanceName instanceName: rac11g11
2011-03-07T12:15:57.905+0000 UCP FINEST seq-48,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setDbUniqueName dbUniqueName: rac11g1
2011-03-07T12:15:57.906+0000 UCP FINEST seq-49,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setHostName hostName: rac1
2011-03-07T12:15:57.906+0000 UCP FINEST seq-50,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setStatus status: down
2011-03-07T12:15:57.906+0000 UCP FINEST seq-51,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setReason reason: user
On 11gR2 the domain name is missing on the service name
** UCPPool : connection returned to pool
2011-03-07T12:41:47.951+0000 UCP FINEST seq-49,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setEventType eventType: database/event/service
2011-03-07T12:41:47.951+0000 UCP FINEST seq-50,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.validateEventType eventType: database/event/service
2011-03-07T12:41:47.952+0000 UCP FINEST seq-51,thread-11 oracle.ucp.jdbc.oracle.ONSDatabaseFailoverEvent. eventType: database/event/service, eventBody: VERSION=1.0 service=rac11g2 instance=rac11g21 database=rac11g2 host=rac4 status=down reason=USER
2011-03-07T12:41:47.952+0000 UCP FINEST seq-52,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setServiceName serviceName: rac11g2
2011-03-07T12:41:47.952+0000 UCP FINEST seq-53,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setInstanceName instanceName: rac11g21
2011-03-07T12:41:47.953+0000 UCP FINEST seq-54,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setDbUniqueName dbUniqueName: rac11g2
2011-03-07T12:41:47.953+0000 UCP FINEST seq-55,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setHostName hostName: rac4
2011-03-07T12:41:47.953+0000 UCP FINEST seq-56,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setStatus status: down
2011-03-07T12:41:47.954+0000 UCP FINEST seq-57,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setReason reason: user
According to Oracle Documentation provided throught the SRthis service name should match the service name in dba_services. Looking at this dba view
SQL> select service_id,name from dba_services;

SERVICE_ID NAME
---------- ------------
1 SYS$BACKGROUND
2 SYS$USERS
3 rac11g2sXDB
4 rac11g2s.domain.net
5 rac11g2XDB
6 rac11g2.domain.net

6 rows selected.

SQL> show parameter service

NAME TYPE VALUE
------------- ------ -------------------
service_names string rac11g2.domain.net
Pointed this out to oracle and was told this is related to "RDBMS Bug 9363352: SERVICE NAME HAS NO DOMAIN FOR UP/DOWN EVENTS, closed as not a bug" and at this point Oracle couldn't reproduce the error seen above (No more data to read from socket). Looking at the service name oracle is using noticed that it is a service name without a domain.

So created service name without a domain and tested.
srvctl add service -d rac11g2 -s nodomain -r "rac11g21,rac11g22"

srvctl start service -d rac11g2 -s nodomain

srvctl status service -d rac11g2 -s nodomain
Service nodomain is running on instance(s) rac11g21,rac11g22

SQL> show parameter service

NAME TYPE VALUE
------------- ------ ----------------------------
service_names string rac11g2.domain.net,nodomain

SQL> select name from dba_services;

NAME
---------------------
SYS$BACKGROUND
SYS$USERS
rac11g2sXDB
rac11g2s.domain.net
rac11g2XDB
rac11g2.domain.net
nodomain

$ lsnrctl status

LSNRCTL for Linux: Version 11.2.0.2.0 - Production on 11-MAR-2011 10:34:57

Copyright (c) 1991, 2010, Oracle.  All rights reserved.

Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 11.2.0.2.0 - Production
Start Date                07-MAR-2011 12:48:29
Uptime                    3 days 21 hr. 46 min. 27 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /opt/app/11.2.0/grid/network/admin/listener.ora
Listener Log File         /opt/app/oracle/diag/tnslsnr/rac4/listener/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.86)(PORT=1521)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.90)(PORT=1521)))
Services Summary...
Service "+ASM" has 1 instance(s).
Instance "+ASM1", status READY, has 1 handler(s) for this service...
Service "nodomain.domain.net" has 1 instance(s).
Instance "rac11g21", status READY, has 1 handler(s) for this service...
Service "rac11g2.domain.net" has 2 instance(s).
Instance "rac11g21", status UNKNOWN, has 1 handler(s) for this service...
Instance "rac11g21", status READY, has 1 handler(s) for this service...
Ran the test with logging enabled. This time the service name has the domain name even though it was created without a one and connection pool get refresh and validated quickly just as it was with 11gR1
** UCPPool : connection returned to pool
2011-03-11T10:25:00.022+0000 UCP FINEST seq-75,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setEventType eventType: database/event/service
2011-03-11T10:25:00.022+0000 UCP FINEST seq-76,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.validateEventType eventType: database/event/service
2011-03-11T10:25:00.022+0000 UCP FINEST seq-77,thread-11 oracle.ucp.jdbc.oracle.ONSDatabaseFailoverEvent. eventType: database/event/service, eventBody: VERSION=1.0 service=nodomain.domain.net instance=rac11g22 database=rac11g2 host=rac5 status=down reason=USER
2011-03-11T10:25:00.023+0000 UCP FINEST seq-78,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setServiceName serviceName: nodomain.domain.net
2011-03-11T10:25:00.023+0000 UCP FINEST seq-79,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setInstanceName instanceName: rac11g22
2011-03-11T10:25:00.023+0000 UCP FINEST seq-80,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setDbUniqueName dbUniqueName: rac11g2
2011-03-11T10:25:00.024+0000 UCP FINEST seq-81,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setHostName hostName: rac5
2011-03-11T10:25:00.024+0000 UCP FINEST seq-82,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setStatus status: down
2011-03-11T10:25:00.024+0000 UCP FINEST seq-83,thread-11 oracle.ucp.jdbc.oracle.OracleFailoverEventImpl.setReason reason: user
Waiting for Oracle reply on this.

Update 06 April 2011
Oracle came back with this is similar to "unpublished bug 9740127 fixed in version 12" also mentioned couple of bugs 8779597 and 9740127 and pointed out that if a application service was created with domain name (instead of using the default database service) failover works as expected.

So created an application service (additional service as before)
srvctl add service -d rac11g2 -s withdomain.domain.net -r "rac11g21,rac11g22"
Ran the test again and the connection failover was as before.
** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #9  -> instance[rac11g21], host[rac4], service[withdomain.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 4
BorrowedConnectionsCount: 1
---------------------------------

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #10  -> instance[rac11g21], host[rac4], service[withdomain.domain.net]

----------- UCP Details ---------
NumberOfAvailableConnections: 0
BorrowedConnectionsCount: 1
---------------------------------
In other versions (10gR2 and 11gR1) default service with or without domain would work fine in a failover situation.
Created another database without a domain name and ran the test again and saw that failover wasn't working as expected.
** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #181  -> instance[testdb1], host[rac4], service[testdb]

----------- UCP Details ---------
NumberOfAvailableConnections: 1
BorrowedConnectionsCount: 1
---------------------------------

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Connection retry necessary : No more data to read from socket
** FCFTest : Apr 6, 2011 2:14 PM SUCCESS     Connections:(Available=1 Affected=0 FailedToProcess=0 MarkedDown=0 Closed=0)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0) cardinality=2 targetedToTear=0 tornDown=0 markedToClose=0 targetUpEventNewConns=1 upEventNewConns=1
Apr 6, 2011 2:14 PM SUCCESS      Connections:(Available=2 Affected=1 FailedToProcess=0 MarkedDown=1 Closed=1)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0)
Apr 6, 2011 2:13 PM SUCCESS     Connections:(Available=1 Affected=0 FailedToProcess=0 MarkedDown=0 Closed=0)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0) cardinality=2 targetedToTear=0 tornDown=0 markedToClose=0 targetUpEventNewConns=1 upEventNewConns=1
Apr 6, 2011 2:12 PM SUCCESS      Connections:(Available=1 Affected=1 FailedToProcess=0 MarkedDown=1 Closed=1)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0)
Apr 6, 2011 2:12 PM SUCCESS     Connections:(Available=1 Affected=0 FailedToProcess=0 MarkedDown=0 Closed=0)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0) cardinality=2 targetedToTear=0 tornDown=0 markedToClose=0 targetUpEventNewConns=1 upEventNewConns=1
Apr 6, 2011 2:11 PM SUCCESS      Connections:(Available=2 Affected=1 FailedToProcess=0 MarkedDown=1 Closed=1)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0)
Apr 6, 2011 2:11 PM SUCCESS     Connections:(Available=1 Affected=0 FailedToProcess=0 MarkedDown=0 Closed=0)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0) cardinality=2 targetedToTear=0 tornDown=0 markedToClose=0 targetUpEventNewConns=1 upEventNewConns=1
Apr 6, 2011 2:10 PM SUCCESS      Connections:(Available=3 Affected=2 FailedToProcess=0 MarkedDown=2 Closed=2)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0)
Apr 6, 2011 2:10 PM SUCCESS     Connections:(Available=2 Affected=0 FailedToProcess=0 MarkedDown=0 Closed=0)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0) cardinality=2 targetedToTear=0 tornDown=0 markedToClose=0 targetUpEventNewConns=2 upEventNewConns=2
Apr 6, 2011 2:09 PM SUCCESS      Connections:(Available=6 Affected=4 FailedToProcess=0 MarkedDown=4 Closed=4)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0)
Apr 6, 2011 2:09 PM SUCCESS     Connections:(Available=3 Affected=0 FailedToProcess=0 MarkedDown=0 Closed=0)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0) cardinality=2 targetedToTear=0 tornDown=0 markedToClose=0 targetUpEventNewConns=3 upEventNewConns=3
Apr 6, 2011 2:08 PM SUCCESS      Connections:(Available=5 Affected=2 FailedToProcess=0 MarkedDown=2 Closed=2)(Borrowed=0 Affected=0 FailedToProcess=0 MarkedDown=0 MarkedDeferredClose=0 Closed=0)

** UCPPool : connection returned to pool
** UCPPool : retrieveConnection
** FCFTest : Query #183  -> instance[testdb2], host[rac5], service[testdb]

----------- UCP Details ---------
NumberOfAvailableConnections: 0
BorrowedConnectionsCount: 1
---------------------------------
Oracle confirmed this is related to using default database service and suggested to create and use application services instead of the default database service.

Update 12 March 2012
UCP delay in processing ONS events during data guard or RAC failover using FCF [ID 1380527.1]
Apply 12596492 patch (available for 11.2.0.2 and 11.2.0.3)