Friday, September 17, 2010

Restoring OCR and Vote disk due to ASM disk failures - 2

Earlier blog posts was about restoring OCR & Vote disk when only one of the disks in the ASM disk group containing them suffers a failure.

This post is when two disks in the ASM diskgroup are affected by some failure. The procedure is different to that of loosing two disks when only one of the clusterware files(OCR, Vote) is in the disk group.

Because the diskgroup (in this case normal redundancy) loses its quorum it won't be mountable and would require the recreation of asm diskgroup and restoring OCR first and then Vote disk.

Scenario 2.
1. Both OCR and Vote disks are in ASM diskgroup
2. ASM diskgroup has normal redundancy with only three failure groups
3. Only Two failure groups are affected

1. Current OCR and vote disk configuration.
ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 262120
Used space (kbytes) : 2616
Available space (kbytes) : 259504
ID : 2002737697
Device/File Name : +CLUSTERDG
Device/File integrity check succeeded
Device/File not configured
Cluster registry integrity check succeeded
Logical corruption check succeeded

crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 34845aa0b29d4f36bf8743e3506eba12 (ORCL:CLUS2) [CLUSTERDG]
2. ONLINE 1200c8daed494fbabf6f64ee6e07fde8 (ORCL:CLUS3) [CLUSTERDG]
3. ONLINE 98557cefaca24fcdbf8807f3dd1fbd29 (ORCL:CLUS1) [CLUSTERDG]
Located 3 voting disk(s).
2. Corrupt the disks to simulate disk failure.
# /etc/init.d/oracleasm querydisk -p clus1
Disk "CLUS1" is a valid ASM disk
/dev/sdc2: LABEL="CLUS1" TYPE="oracleasm"

# /etc/init.d/oracleasm querydisk -p clus2
Disk "CLUS2" is a valid ASM disk
/dev/sdc3: LABEL="CLUS2" TYPE="oracleasm"

# dd if=/dev/zero of=/dev/sdc2 count=204800 bs=8192
204800+0 records in
204800+0 records out
1677721600 bytes (1.7 GB) copied, 1.6441 seconds, 1.0 GB/s

# dd if=/dev/zero of=/dev/sdc3 count=204800 bs=8192
204800+0 records in
204800+0 records out
1677721600 bytes (1.7 GB) copied, 1.59438 seconds, 1.1 GB/s
3. ocssd.log will show the vote disk corruption being detected.
2010-09-15 15:26:04.272: [    CSSD][1332816192]clssnmvDiskKillCheck: voting disk corrupted (0x00000000,0x00000000) (ORCL:CLUS1)
2010-09-15 15:26:04.272: [ CSSD][1332816192]clssnmvDiskAvailabilityChange: voting file ORCL:CLUS1 now offline
2010-09-15 15:26:04.364: [ CLSF][1322326336]Closing handle:0x126d1f50
2010-09-15 15:26:04.364: [ SKGFD][1322326336]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x12927d50 for
disk :ORCL:CLUS1:
4. Querying the vote disks possibel whereas ocrcheck fails
crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 34845aa0b29d4f36bf8743e3506eba12 (ORCL:CLUS2) [CLUSTERDG]
2. ONLINE 1200c8daed494fbabf6f64ee6e07fde8 (ORCL:CLUS3) [CLUSTERDG]
3. ONLINE 98557cefaca24fcdbf8807f3dd1fbd29 (ORCL:CLUS1) [CLUSTERDG]
Located 3 voting disk(s).

# ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage
5. Stop the cluster and crs services
# crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hpc1'
CRS-2673: Attempting to stop 'ora.crsd' on 'hpc1'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'hpc1'
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'hpc1'
CRS-2673: Attempting to stop 'ora.CLUSTERDG.dg' on 'hpc1'
CRS-2673: Attempting to stop 'ora.clusdb.db' on 'hpc1'
CRS-2673: Attempting to stop 'ora.registry.acfs' on 'hpc1'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'hpc1'
CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'hpc1'
CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.hpc1.vip' on 'hpc1'
CRS-2677: Stop of 'ora.scan1.vip' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.hpc1.vip' on 'hpc1' succeeded
CRS-4549: Unexpected disconnect while executing shutdown request.
CRS-2675: Stop of 'ora.crsd' on 'hpc1' failed
CRS-2679: Attempting to clean 'ora.crsd' on 'hpc1'
CRS-4548: Unable to connect to CRSD
CRS-2678: 'ora.crsd' on 'hpc1' has experienced an unrecoverable failure
CRS-0267: Human intervention required to resume its availability.

CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'hpc1' has failed
CRS-4687: Shutdown command has completed with error(s).
CRS-4000: Command Stop failed, or completed with errors.

# crsctl stop cluster
CRS-2673: Attempting to stop 'ora.crsd' on 'hpc1'
CRS-4548: Unable to connect to CRSD
CRS-2675: Stop of 'ora.crsd' on 'hpc1' failed
CRS-2679: Attempting to clean 'ora.crsd' on 'hpc1'
CRS-4548: Unable to connect to CRSD
CRS-2678: 'ora.crsd' on 'hpc1' has experienced an unrecoverable failure
CRS-0267: Human intervention required to resume its availability.
CRS-4000: Command Stop failed, or completed with errors.
Some of the components fails to shutdown. Use the force option to stop the crs but even then ASM instnace and database instnace will remain open, they must be manually shutdown loging in as sys / sysdba[sysasm]
# crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hpc1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'hpc1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.evmd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.asm' on 'hpc1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'hpc1'
CRS-2677: Stop of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.asm' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'hpc1'
CRS-2677: Stop of 'ora.cssd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.diskmon' on 'hpc1'
CRS-2673: Attempting to stop 'ora.gipcd' on 'hpc1'
CRS-2677: Stop of 'ora.gipcd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'hpc1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'hpc1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
6. Start the crs on one node in exclusive mode.
# crsctl start crs -excl
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.gipcd' on 'hpc1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'hpc1'
CRS-2676: Start of 'ora.gipcd' on 'hpc1' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'hpc1'
CRS-2676: Start of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'hpc1'
CRS-2676: Start of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'hpc1'
CRS-2679: Attempting to clean 'ora.diskmon' on 'hpc1'
CRS-2681: Clean of 'ora.diskmon' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.diskmon' on 'hpc1'
CRS-2676: Start of 'ora.diskmon' on 'hpc1' succeeded
CRS-2676: Start of 'ora.cssd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.ctssd' on 'hpc1'
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'hpc1'
CRS-2676: Start of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2676: Start of 'ora.ctssd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'hpc1'
CRS-2676: Start of 'ora.asm' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'hpc1'
CRS-2676: Start of 'ora.crsd' on 'hpc1' succeeded
Stop the crsd if started
crsctl stop resource ora.crsd -init
ocssd.log will show only one vote disk is detected
2010-09-15 15:34:19.853: [    CLSF][1148365120]Opened hdl:0x13512f20 for dev:ORCL:CLUS3:
2010-09-15 15:34:19.865: [ CSSD][1148365120]clssnmvDiskVerify: Successful discovery for disk ORCL:CLUS3, UID 1200c8da-ed494fba-bf6f64ee-6e07fde8, Pending CIN 0:1284559165:0, Committed CIN 0:1284559165:0
2010-09-15 15:34:19.865: [ CLSF][1148365120]Closing handle:0x13512f20
2010-09-15 15:34:19.865: [ SKGFD][1148365120]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x1350a590 for
disk :ORCL:CLUS3:

2010-09-15 15:34:19.865: [ CSSD][1148365120]clssnmvDiskVerify: file is not a voting file, cannot recognize on-disk signature for a voting
2010-09-15 15:34:19.865: [ CSSD][1148365120]clssnmvDiskVerify: file is not a voting file, cannot recognize on-disk signature for a voting
2010-09-15 15:34:19.865: [ CSSD][1148365120]clssnmvDiskVerify: file is not a voting file, cannot recognize on-disk signature for a voting
2010-09-15 15:34:19.865: [ CSSD][1148365120]clssnmvDiskVerify: file is not a voting file, cannot recognize on-disk signature for a voting
2010-09-15 15:34:19.865: [ CSSD][1148365120]clssnmvDiskVerify: file is not a voting file, cannot recognize on-disk signature for a voting
2010-09-15 15:34:19.865: [ CSSD][1148365120]clssnmvDiskVerify: file is not a voting file, cannot recognize on-disk signature for a voting
2010-09-15 15:34:19.865: [ CSSD][1148365120]clssnmvDiskVerify: Successful discovery of 1 disks
2010-09-15 15:34:19.865: [ CSSD][1148365120]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
7. At this all the diskgroup will be dismount state and trying to mount them or drop them with force option will result in an error.
SQL> select name,state from v$asm_diskgroup;
NAME STATE
--------------- ---------------
CLUSTERDG DISMOUNTED
DATA DISMOUNTED
FLASH DISMOUNTED

SQL> alter diskgroup clusterdg mount force;
alter diskgroup clusterdg mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15017: diskgroup "CLUSTERDG" cannot be mounted
ORA-15063: ASM discovered an insufficient number of disks for diskgroup
"CLUSTERDG"
11g has a new ASM command option which allows asm diskgroup to be dropped when they are offline. But this too fails because disk group contains vote disks
SQL> drop diskgroup clusterdg force including contents;
drop diskgroup clusterdg force including contents
*
ERROR at line 1:
ORA-15039: diskgroup not dropped
ORA-15276: ASM diskgroup CLUSTERDG has cluster voting files
Querying the ora-15276 error code shows
oerr ora 15276
15276, 00000, "ASM diskgroup %s has cluster voting files"
// *Cause: An attempt was made to drop a diskgroup that contained cluster
// voting files.
// *Action: Move the cluster voting files out of the diskgroup and retry the
// operation.
But trying to move the vote disks will also fail due to unavailability of the crs
crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. OFFLINE 6c23b17c25c34fb2bf02b3d6979a3e29 () []
2. OFFLINE 34845aa0b29d4f36bf8743e3506eba12 () []
3. ONLINE 1200c8daed494fbabf6f64ee6e07fde8 (ORCL:CLUS3) [CLUSTERDG]
4. OFFLINE 98557cefaca24fcdbf8807f3dd1fbd29 () []
Located 4 voting disk(s).

crsctl replace votedisk /dev/sdc6
Oracle Cluster Registry initialization failed accessing Oracle Cluster Registry device: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=6, opn=kgfo, dep=0, loc=kgfoCkMt03
diskgroup CLUSTERDG not mounted ()
] [6]
CRS-4000: Command Replace failed, or completed with errors.
Solution is to restore the OCR first and have the crs up and running before dealing with vote disks.

8. Repair the failed disks and create a new disk group, use the surviving disk in the new disk group with force option
# /etc/init.d/oracleasm deletedisk clus1
Removing ASM disk "clus1": [ OK ]
# /etc/init.d/oracleasm deletedisk clus2
Removing ASM disk "clus2": [ OK ]

# /etc/init.d/oracleasm createdisk clus1 /dev/sdc2
Marking disk "clus1" as an ASM disk: [ OK ]
# /etc/init.d/oracleasm createdisk clus2 /dev/sdc3
Marking disk "clus2" as an ASM disk: [ OK ]

SQL> create diskgroup clusterdgbk disk 'ORCL:CLUS1' DISK 'ORCL:CLUS2' DISK 'ORCL:CLUS3' force attribute 'compatible.asm'='11.2';

Diskgroup created.

SQL> select name,state from v$asm_diskgroup;

NAME STATE
--------------- --------
CLUSTERDG MOUNTED
9. Trying to replace the vote disks would still fail because crs is not available
crsctl replace votedisk +clusterdgbk
Oracle Cluster Registry initialization failed accessing Oracle Cluster Registry device: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=6, opn=kgfo, dep=0, loc=kgfoCkMt03
diskgroup CLUSTERDG not mounted ()
] [6]
CRS-4000: Command Replace failed, or completed with errors.
Restoring crs would also fail because it is still looking for the old diskgroup name.
/ocrconfig -restore /opt/app/11.2.0/grid/cdata/hpc-cluster/backup_20100915_151240.ocr
PROT-16: Internal Error
Recreate the diskgroup with the same name and restore the ocr from a backup file
SQL> alter diskgroup clusterdgbk dismount;

Diskgroup altered.

SQL> create diskgroup clusterdg disk 'ORCL:CLUS1' force disk 'ORCL:CLUS2' force DISK 'ORCL:CLUS3' force attribute 'compatible.asm'='11.2';

Diskgroup created.

./ocrconfig -restore /opt/app/11.2.0/grid/cdata/hpc-cluster/backup_20100915_151240.ocr
10. Still restoring vote disk fails because ocr files have been only restored. Stop the crs and start again in exclusive mode and restore the vote disks.
crsctl replace votedisk +clusterdg
Failed to create voting files on disk group clusterdg.
Change to configuration failed, but was successfully rolled back.
CRS-4000: Command Replace failed, or completed with errors.

crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hpc1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'hpc1'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'hpc1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.asm' on 'hpc1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'hpc1'
CRS-2677: Stop of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.asm' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'hpc1'
CRS-2677: Stop of 'ora.cssd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.diskmon' on 'hpc1'
CRS-2677: Stop of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'hpc1'
CRS-2677: Stop of 'ora.gipcd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'hpc1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'hpc1' has completed
CRS-4133: Oracle High Availability Services has been stopped.

crsctl start crs -excl
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.gipcd' on 'hpc1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'hpc1'
CRS-2676: Start of 'ora.gipcd' on 'hpc1' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'hpc1'
CRS-2676: Start of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'hpc1'
CRS-2676: Start of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'hpc1'
CRS-2679: Attempting to clean 'ora.diskmon' on 'hpc1'
CRS-2681: Clean of 'ora.diskmon' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.diskmon' on 'hpc1'
CRS-2676: Start of 'ora.diskmon' on 'hpc1' succeeded
CRS-2676: Start of 'ora.cssd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.ctssd' on 'hpc1'
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'hpc1'
CRS-2676: Start of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2676: Start of 'ora.ctssd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'hpc1'
CRS-2676: Start of 'ora.asm' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'hpc1'
CRS-2676: Start of 'ora.crsd' on 'hpc1' succeeded

crsctl replace votedisk +clusterdg
Successful addition of voting disk 53dd1707604f4fc5bf910fc59bd857f8
Successful addition of voting disk 7496834c116f4f53bf72b8aa726a8ede
Successful addition of voting disk 9c6c24ea6cd74f66bf616d7624b856af
Successfully replaced voting disk group with +clusterdg.
CRS-4266: Voting file(s) successfully replaced

crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 53dd1707604f4fc5bf910fc59bd857f8 (ORCL:CLUS1) [CLUSTERDG]
2. ONLINE 7496834c116f4f53bf72b8aa726a8ede (ORCL:CLUS2) [CLUSTERDG]
3. ONLINE 9c6c24ea6cd74f66bf616d7624b856af (ORCL:CLUS3) [CLUSTERDG]
Located 3 voting disk(s).

ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 262120
Used space (kbytes) : 2588
Available space (kbytes) : 259532
ID : 2002737697
Device/File Name : +CLUSTERDG
Device/File integrity check succeeded
Device/File not configured

Cluster registry integrity check succeeded
Logical corruption check succeeded
11. Stop the crs and start crs on normal mode on all nodes.
crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hpc1'
CRS-2673: Attempting to stop 'ora.crsd' on 'hpc1'
CRS-2677: Stop of 'ora.crsd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'hpc1'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'hpc1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.asm' on 'hpc1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'hpc1'
CRS-2677: Stop of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.asm' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'hpc1'
CRS-2677: Stop of 'ora.cssd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.diskmon' on 'hpc1'
CRS-2677: Stop of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'hpc1'
CRS-2677: Stop of 'ora.gipcd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'hpc1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'hpc1' has completed
CRS-4133: Oracle High Availability Services has been stopped.

crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
Useful Metalink note
How to restore ASM based OCR after complete loss of the CRS diskgroup on Linux/Unix systems [ID 1062983.1]