Friday, August 13, 2010

Restoring OCR due to ASM disk failures - 2

This is the scenario 2 where all disk in the diskgroup where OCR is stored failes.

Scenario 2.
1. Only OCR in ASM diskgroup
2. ASM diskgroup has normal redundancy with only two failuregroups
3. Both failure groups are affected
4. Vote disk is located in a separate location (as in scenario 1)


In this case it requires an OCR restoration.

1. If testing create a manual backup of the OCR with
ocrconfig -manualbackup
2.Identify the disks in the disk group where OCR resides and corrupt them to simulate disk failure
SQL> select d.name,g.name from v$asm_disk d, v$asm_diskgroup g 
where d.group_number=g.group_number and g.name='CLUSTERDG';

NAME NAME
------ ----------
CLUS1 CLUSTERDG
CLUS3 CLUSTERDG


# /etc/init.d/oracleasm querydisk -p clus1
Disk "CLUS1" is a valid ASM disk
/dev/sdc2: LABEL="CLUS1" TYPE="oracleasm"
# /etc/init.d/oracleasm querydisk -p clus3
Disk "CLUS3" is a valid ASM disk
/dev/sdc10: LABEL="CLUS3" TYPE="oracleasm"

# dd if=/dev/zero of=/dev/sdc2 count=20480 bs=8192
20480+0 records in
20480+0 records out
167772160 bytes (168 MB) copied, 0.155909 seconds, 1.1 GB/s
# dd if=/dev/zero of=/dev/sdc10 count=20480 bs=8192
20480+0 records in
20480+0 records out
167772160 bytes (168 MB) copied, 0.157382 seconds, 1.1 GB/s
3.Try checking the OCR intergrity
ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage
4. To hasten the detection of OCR failure issue manual OCR backup command (mentioned step 1). This will log the OCR failure on cluster alert log
2010-08-13 16:29:24.591
[crsd(15464)]CRS-1006:The OCR location +CLUSTERDG is inaccessible. Details in /opt/app/11.2.0/grid/log/hpc1/crsd/crsd.log.
2010-08-13 16:29:34.618
[/opt/app/11.2.0/grid/bin/orarootagent.bin(15599)]CRS-5822:Agent '/opt/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) in /opt/app/11.2.0/grid/log/hpc1/agent/crsd/orarootagent_root/orarootagent_root.log.
2010-08-13 16:29:35.363
[ohasd(13310)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'.
2010-08-13 16:29:37.385
[ohasd(13310)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'.
2010-08-13 16:29:39.408
[ohasd(13310)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'.
2010-08-13 16:29:41.429
[ohasd(13310)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'.
2010-08-13 16:29:55.583
[ohasd(13310)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart.
and crsd.log
2010-08-13 16:29:54.798: [GIPCXCPT][1291424336] gipcShutdownF: skipping shutdown, count 1, from [ clsgpnp0.c : 1021], ret gipcretSuccess (0)
2010-08-13 16:29:54.841: [ OCRRAW][1291424336]propriogid:1_2: INVALID FORMAT
2010-08-13 16:29:54.841: [ OCRRAW][1291424336]proprioini: all disks are not OCR/OLR formatted
2010-08-13 16:29:54.841: [ OCRRAW][1291424336]proprinit: Could not open raw device
2010-08-13 16:29:54.844: [ OCRAPI][1291424336]a_init:16!: Backend init unsuccessful : [26]
2010-08-13 16:29:54.844: [ CRSOCR][1291424336] OCR context init failure. Error: PROC-26: Error while accessing the physical storage
2010-08-13 16:29:54.844: [ CRSD][1291424336][PANIC] CRSD exiting: Could not init OCR, code: 26
2010-08-13 16:29:54.844: [ CRSD][1291424336] Done.
But the cluster database instance on that node is open for access.

5. stop the crs on all the nodes with force option since normal crs stop command will fail due to OCR corruption
crsctl stop crs
CRS-2796: The command may not proceed when Cluster Ready Services is not running
CRS-4687: Shutdown command has completed with error(s).
CRS-4000: Command Stop failed, or completed with errors.

crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hpc1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'hpc1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'hpc1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.evmd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.asm' on 'hpc1'
CRS-2677: Stop of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.asm' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'hpc1'
CRS-2677: Stop of 'ora.cssd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.diskmon' on 'hpc1'
CRS-2673: Attempting to stop 'ora.gipcd' on 'hpc1'
CRS-2677: Stop of 'ora.gipcd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'hpc1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'hpc1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
If the command seems stuck at
at CRS-2677: Stop of 'ora.ctssd' on 'hpc1' succeeded
or any other point check if there are any sessions (sqlplus or otherwise) on the ASM instnace. This would be visible on cluster alert log with
[client(23327)]CRS-10001:Waiting for ASM to shutdown.
[client(23333)]CRS-10001:Waiting for ASM to shutdown.
[client(23340)]CRS-10001:Waiting for ASM to shutdown.
[client(23346)]CRS-10001:Waiting for ASM to shutdown.
Stop the DB instance and ASM instance manually and crs stop command will run to completion.

6. Repair the corrupted disks and make them candidate disks for ASM
/etc/init.d/oracleasm  deletedisk clus1
Removing ASM disk "clus1": [ OK ]
# /etc/init.d/oracleasm deletedisk clus3
Removing ASM disk "clus3": [ OK ]
# /etc/init.d/oracleasm createdisk clus1 /dev/sdc2
Marking disk "clus1" as an ASM disk: [ OK ]
# /etc/init.d/oracleasm createdisk clus3 /dev/sdc10
Marking disk "clus3" as an ASM disk: [ OK ]
7. Start the cluster in exclusive mode only on one node, as root with
crsctl start crs -excl
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.gipcd' on 'hpc1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'hpc1'
CRS-2676: Start of 'ora.gipcd' on 'hpc1' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'hpc1'
CRS-2676: Start of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'hpc1'
CRS-2676: Start of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'hpc1'
CRS-2679: Attempting to clean 'ora.diskmon' on 'hpc1'
CRS-2681: Clean of 'ora.diskmon' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.diskmon' on 'hpc1'
CRS-2676: Start of 'ora.diskmon' on 'hpc1' succeeded
CRS-2676: Start of 'ora.cssd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.ctssd' on 'hpc1'
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'hpc1'
CRS-2676: Start of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2676: Start of 'ora.ctssd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'hpc1'
CRS-2676: Start of 'ora.asm' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'hpc1'
CRS-2676: Start of 'ora.crsd' on 'hpc1' succeeded
Observe the output on crsd.log
2010-08-13 16:46:29.224: [  OCRRAW][3701277264]proprioo: No OCR/OLR devices are usable
2010-08-13 16:46:29.224: [ OCRASM][3701277264]proprasmcl: asmhandle is NULL
2010-08-13 16:46:29.224: [ OCRRAW][3701277264]proprinit: Could not open raw device
2010-08-13 16:46:29.224: [ OCRASM][3701277264]proprasmcl: asmhandle is NULL
2010-08-13 16:46:29.224: [ OCRAPI][3701277264]a_init:16!: Backend init unsuccessful : [26]
2010-08-13 16:46:29.225: [ CRSOCR][3701277264] OCR context init failure. Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=8, opn=kgfoOpenFile01, dep=15056, loc=kgfokge
ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +CLUSTERDG.255.4294967295
ORA-17503: ksfdopn:2 Failed to open file +CLUSTERDG.255.4294967295
ORA-15001: disk
] [8]
2010-08-13 16:46:29.225: [ CRSD][3701277264][PANIC] CRSD exiting: Could not init OCR, code: 26
2010-08-13 16:46:29.225: [ CRSD][3701277264] Done.
and cluster alert log
2010-08-13 16:46:28.101
[ohasd(23658)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'.
2010-08-13 16:46:29.224
[crsd(24464)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /opt/app/11.2.0/grid/log/hpc1/crsd/crsd.log.
2010-08-13 16:46:30.121
[ohasd(23658)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'.
2010-08-13 16:46:30.121
[ohasd(23658)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart.
If the crsd starts then stop it with
crsctl stop resource ora.crsd -init
7. Recreate the failed diskgroup
create diskgroup clusterdg normal redundancy failgroup failgp1 disk 'ORCL:CLUS1'
failgroup failgp2 disk 'ORCL:CLUS2' ATTRIBUTE 'compatible.asm' = '11.2';
8. Restore the OCR from backup using ocrconfig as root
# ocrconfig -restore /opt/app/11.2.0/grid/cdata/crs_10g_cluster/backup_20100813_162434.ocr
On the cluster alert log
2010-08-13 16:54:03.655
[client(24568)]CRS-1002:The OCR was restored from /opt/app/11.2.0/grid/cdata/crs_10g_cluster/backup_20100813_162434.ocr.
9. Run ocrcheck to verify cluster intergrity is intact
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 148348
Used space (kbytes) : 4580
Available space (kbytes) : 143768
ID : 552644455
Device/File Name : +CLUSTERDG
Device/File integrity check succeeded

Device/File not configured

Cluster registry integrity check succeeded
Logical corruption check succeeded
10. Stop the cluster with force option
crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hpc1'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'hpc1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.asm' on 'hpc1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'hpc1'
CRS-2677: Stop of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.asm' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'hpc1'
CRS-2677: Stop of 'ora.cssd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.diskmon' on 'hpc1'
CRS-2677: Stop of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'hpc1'
CRS-2677: Stop of 'ora.gipcd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'hpc1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'hpc1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
and start crs in normal mode on all the nodes
crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
Now the cluster is up and running. Verify with cluvfy
cluvfy  comp ocr -n all -verbose

Verifying OCR integrity

Checking OCR integrity...

Checking the absence of a non-clustered configuration...
All nodes free of non-clustered, local-only configurations

ASM Running check passed. ASM is running on all cluster nodes

Checking OCR config file "/etc/oracle/ocr.loc"...

OCR config file "/etc/oracle/ocr.loc" check successful

Disk group for ocr location "+CLUSTERDG" available on all the nodes

Checking size of the OCR location "+CLUSTERDG" ...

Size check for OCR location "+CLUSTERDG" successful...

WARNING:
This check does not verify the integrity of the OCR contents. Execute 'ocrcheck' as a privileged user to verify the contents of OCR.

OCR integrity check passed

Verification of OCR integrity was successful.
Useful Metalink note
How to restore ASM based OCR after complete loss of the CRS diskgroup on Linux/Unix systems [ID 1062983.1]