A! Help: Restoring OCR due to ASM disk failures

Friday, August 13, 2010

Restoring OCR due to ASM disk failures - 2

This is the scenario 2 where all disk in the diskgroup where OCR is stored failes.

Scenario 2.
1. Only OCR in ASM diskgroup
2. ASM diskgroup has normal redundancy with only two failuregroups
3. Both failure groups are affected
4. Vote disk is located in a separate location (as in scenario 1)

In this case it requires an OCR restoration.

1. If testing create a manual backup of the OCR with

ocrconfig -manualbackup

2.Identify the disks in the disk group where OCR resides and corrupt them to simulate disk failure

SQL> select d.name,g.name from v$asm_disk d, v$asm_diskgroup g 
where d.group_number=g.group_number and g.name='CLUSTERDG';

NAME   NAME
------ ----------
CLUS1  CLUSTERDG
CLUS3  CLUSTERDG               


# /etc/init.d/oracleasm querydisk -p clus1
Disk "CLUS1" is a valid ASM disk
/dev/sdc2: LABEL="CLUS1" TYPE="oracleasm"
# /etc/init.d/oracleasm querydisk -p clus3
Disk "CLUS3" is a valid ASM disk
/dev/sdc10: LABEL="CLUS3" TYPE="oracleasm"

# dd if=/dev/zero of=/dev/sdc2 count=20480 bs=8192
20480+0 records in
20480+0 records out
167772160 bytes (168 MB) copied, 0.155909 seconds, 1.1 GB/s
# dd if=/dev/zero of=/dev/sdc10 count=20480 bs=8192
20480+0 records in
20480+0 records out
167772160 bytes (168 MB) copied, 0.157382 seconds, 1.1 GB/s

3.Try checking the OCR intergrity

ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage

4. To hasten the detection of OCR failure issue manual OCR backup command (mentioned step 1). This will log the OCR failure on cluster alert log

2010-08-13 16:29:24.591
[crsd(15464)]CRS-1006:The OCR location +CLUSTERDG is inaccessible. Details in /opt/app/11.2.0/grid/log/hpc1/crsd/crsd.log.
2010-08-13 16:29:34.618
[/opt/app/11.2.0/grid/bin/orarootagent.bin(15599)]CRS-5822:Agent '/opt/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) in /opt/app/11.2.0/grid/log/hpc1/agent/crsd/orarootagent_root/orarootagent_root.log.
2010-08-13 16:29:35.363
[ohasd(13310)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'.
2010-08-13 16:29:37.385
[ohasd(13310)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'.
2010-08-13 16:29:39.408
[ohasd(13310)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'.
2010-08-13 16:29:41.429
[ohasd(13310)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'. 
2010-08-13 16:29:55.583
[ohasd(13310)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart.

and crsd.log

2010-08-13 16:29:54.798: [GIPCXCPT][1291424336] gipcShutdownF: skipping shutdown, count 1, from [ clsgpnp0.c : 1021], ret gipcretSuccess (0)
2010-08-13 16:29:54.841: [  OCRRAW][1291424336]propriogid:1_2: INVALID FORMAT
2010-08-13 16:29:54.841: [  OCRRAW][1291424336]proprioini: all disks are not OCR/OLR formatted
2010-08-13 16:29:54.841: [  OCRRAW][1291424336]proprinit: Could not open raw device
2010-08-13 16:29:54.844: [  OCRAPI][1291424336]a_init:16!: Backend init unsuccessful : [26]
2010-08-13 16:29:54.844: [  CRSOCR][1291424336] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage
2010-08-13 16:29:54.844: [    CRSD][1291424336][PANIC] CRSD exiting: Could not init OCR, code: 26
2010-08-13 16:29:54.844: [    CRSD][1291424336] Done.

But the cluster database instance on that node is open for access.

5. stop the crs on all the nodes with force option since normal crs stop command will fail due to OCR corruption

crsctl stop crs
CRS-2796: The command may not proceed when Cluster Ready Services is not running
CRS-4687: Shutdown command has completed with error(s).
CRS-4000: Command Stop failed, or completed with errors.

crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hpc1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'hpc1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'hpc1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.evmd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.asm' on 'hpc1'
CRS-2677: Stop of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.asm' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'hpc1'
CRS-2677: Stop of 'ora.cssd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.diskmon' on 'hpc1'
CRS-2673: Attempting to stop 'ora.gipcd' on 'hpc1'
CRS-2677: Stop of 'ora.gipcd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'hpc1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'hpc1' has completed
CRS-4133: Oracle High Availability Services has been stopped.

If the command seems stuck at

at CRS-2677: Stop of 'ora.ctssd' on 'hpc1' succeeded

or any other point check if there are any sessions (sqlplus or otherwise) on the ASM instnace. This would be visible on cluster alert log with

[client(23327)]CRS-10001:Waiting for ASM to shutdown.
[client(23333)]CRS-10001:Waiting for ASM to shutdown.
[client(23340)]CRS-10001:Waiting for ASM to shutdown.
[client(23346)]CRS-10001:Waiting for ASM to shutdown.

Stop the DB instance and ASM instance manually and crs stop command will run to completion.

6. Repair the corrupted disks and make them candidate disks for ASM

/etc/init.d/oracleasm  deletedisk clus1
Removing ASM disk "clus1":                                 [  OK  ]
# /etc/init.d/oracleasm  deletedisk clus3
Removing ASM disk "clus3":                                 [  OK  ]
# /etc/init.d/oracleasm  createdisk clus1 /dev/sdc2
Marking disk "clus1" as an ASM disk:                       [  OK  ]
# /etc/init.d/oracleasm  createdisk clus3 /dev/sdc10
Marking disk "clus3" as an ASM disk:                       [  OK  ]

7. Start the cluster in exclusive mode only on one node, as root with

crsctl start crs -excl
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.gipcd' on 'hpc1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'hpc1'
CRS-2676: Start of 'ora.gipcd' on 'hpc1' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'hpc1'
CRS-2676: Start of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'hpc1'
CRS-2676: Start of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'hpc1'
CRS-2679: Attempting to clean 'ora.diskmon' on 'hpc1'
CRS-2681: Clean of 'ora.diskmon' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.diskmon' on 'hpc1'
CRS-2676: Start of 'ora.diskmon' on 'hpc1' succeeded
CRS-2676: Start of 'ora.cssd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.ctssd' on 'hpc1'
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'hpc1'
CRS-2676: Start of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2676: Start of 'ora.ctssd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'hpc1'
CRS-2676: Start of 'ora.asm' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'hpc1'
CRS-2676: Start of 'ora.crsd' on 'hpc1' succeeded

Observe the output on crsd.log

2010-08-13 16:46:29.224: [  OCRRAW][3701277264]proprioo: No OCR/OLR devices are usable
2010-08-13 16:46:29.224: [  OCRASM][3701277264]proprasmcl: asmhandle is NULL
2010-08-13 16:46:29.224: [  OCRRAW][3701277264]proprinit: Could not open raw device
2010-08-13 16:46:29.224: [  OCRASM][3701277264]proprasmcl: asmhandle is NULL
2010-08-13 16:46:29.224: [  OCRAPI][3701277264]a_init:16!: Backend init unsuccessful : [26]
2010-08-13 16:46:29.225: [  CRSOCR][3701277264] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=8, opn=kgfoOpenFile01, dep=15056, loc=kgfokge
ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +CLUSTERDG.255.4294967295
ORA-17503: ksfdopn:2 Failed to open file +CLUSTERDG.255.4294967295
ORA-15001: disk
] [8]
2010-08-13 16:46:29.225: [    CRSD][3701277264][PANIC] CRSD exiting: Could not init OCR, code: 26
2010-08-13 16:46:29.225: [    CRSD][3701277264] Done.

and cluster alert log

2010-08-13 16:46:28.101
[ohasd(23658)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'.
2010-08-13 16:46:29.224
[crsd(24464)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /opt/app/11.2.0/grid/log/hpc1/crsd/crsd.log.
2010-08-13 16:46:30.121
[ohasd(23658)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'.
2010-08-13 16:46:30.121
[ohasd(23658)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart.

If the crsd starts then stop it with

crsctl stop resource ora.crsd -init

7. Recreate the failed diskgroup

create diskgroup clusterdg normal redundancy failgroup failgp1 disk 'ORCL:CLUS1'
failgroup failgp2 disk 'ORCL:CLUS2' ATTRIBUTE  'compatible.asm' = '11.2';

8. Restore the OCR from backup using ocrconfig as root

# ocrconfig -restore /opt/app/11.2.0/grid/cdata/crs_10g_cluster/backup_20100813_162434.ocr

On the cluster alert log

2010-08-13 16:54:03.655
[client(24568)]CRS-1002:The OCR was restored from /opt/app/11.2.0/grid/cdata/crs_10g_cluster/backup_20100813_162434.ocr.

9. Run ocrcheck to verify cluster intergrity is intact

Status of Oracle Cluster Registry is as follows :
         Version                  :          3
         Total space (kbytes)     :     148348
         Used space (kbytes)      :       4580
         Available space (kbytes) :     143768
         ID                       :  552644455
         Device/File Name         : +CLUSTERDG
                                    Device/File integrity check succeeded

                                    Device/File not configured

           Cluster registry integrity check succeeded
           Logical corruption check succeeded

10. Stop the cluster with force option

crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hpc1'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'hpc1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.asm' on 'hpc1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'hpc1'
CRS-2677: Stop of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.asm' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'hpc1'
CRS-2677: Stop of 'ora.cssd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.diskmon' on 'hpc1'
CRS-2677: Stop of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'hpc1'
CRS-2677: Stop of 'ora.gipcd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'hpc1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'hpc1' has completed
CRS-4133: Oracle High Availability Services has been stopped.

and start crs in normal mode on all the nodes

crsctl start crs
CRS-4123: Oracle High Availability Services has been started.

Now the cluster is up and running. Verify with cluvfy

cluvfy  comp ocr -n all -verbose

Verifying OCR integrity

Checking OCR integrity...

Checking the absence of a non-clustered configuration...
All nodes free of non-clustered, local-only configurations

ASM Running check passed. ASM is running on all cluster nodes

Checking OCR config file "/etc/oracle/ocr.loc"...

OCR config file "/etc/oracle/ocr.loc" check successful

Disk group for ocr location "+CLUSTERDG" available on all the nodes

Checking size of the OCR location "+CLUSTERDG" ...

Size check for OCR location "+CLUSTERDG" successful...

WARNING:
This check does not verify the integrity of the OCR contents. Execute 'ocrcheck' as a privileged user to verify the contents of OCR.

OCR integrity check passed

Verification of OCR integrity was successful.

Useful Metalink note
How to restore ASM based OCR after complete loss of the CRS diskgroup on Linux/Unix systems [ID 1062983.1]

A! Help

Labels

Friday, August 13, 2010

Restoring OCR due to ASM disk failures - 2

About Me

Downloads

Quick Response

Popular

Blog Archive

Total Pageviews

Followers

Oracle Documentation