Scenario 1.
1. Both OCR and Vote disks are in ASM diskgroup
2. ASM diskgroup has normal redundancy with only three failure groups
3. Only one failure group is affected
This scenario is pretty much similar to OCR and Vote disk failure due to one failure group posted earlier. There's no actual restoring of anything, simply repairing the failed diskgroup would suffice. Even dropping the failed disk would allow the cluster to run normally even restart without any problem.
Due to a bug (Bug 7225720) it may not be possible to delete the ASM disk using oracleasm (metalink notes 402526.1 and 1126996.1). Only way to do delete it using oracleasm is to unmount the diskgroup which is not possible since cluster depend on this diskgroup. Therefore a clusterwide shutdown and restart would be required. Unless the disk was dropped from the diskgroup first the diskgroup will not mount during the restart and since diskgroup doesn't mount cluster will not start.
1. Current clusterware file configuration
ocrcheck2. Disk failure is simulated with a corruption
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 148348
Used space (kbytes) : 4580
Available space (kbytes) : 143768
ID : 552644455
Device/File Name : +clusterdg
Device/File integrity check succeeded
Device/File not configured
Cluster registry integrity check succeeded
Logical corruption check succeeded
crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 1e1e904b27c74f6abf4256e67c274824 (ORCL:CLUS1) [CLUSTERDG]
2. ONLINE 30e33d88487a4f4cbf68242bd8a06726 (ORCL:CLUS2) [CLUSTERDG]
3. ONLINE 3fa62d01b5c04fbcbf7160d3014e762c (ORCL:CLUS3) [CLUSTERDG]
dd if=/dev/zero of=/dev/sdc2 count=204800 bs=81923. ocssd.log will list detection of the vote disk corruption
204800+0 records in
204800+0 records out
1677721600 bytes (1.7 GB) copied, 1.68371 seconds, 996 MB/s
2010-08-25 12:39:48.949: [ CSSD][1231718720]clssnmvDiskKillCheck: voting disk corrupted (0x00000000,0x00000000) (ORCL:CLUS1)ocrcheck will show the following
2010-08-25 12:39:48.949: [ CSSD][1231718720]clssnmvDiskAvailabilityChange: voting file ORCL:CLUS1 now offline
2010-08-25 12:39:48.971: [ CLSF][1242208576]Closing handle:0x2aaaac4b0550
2010-08-25 12:39:48.971: [ SKGFD][1242208576]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x2aaaac4a7f40
for disk :ORCL:CLUS1:
2010-08-25 12:39:49.003: [ CSSD][1294657856]clssnmSendingThread: sending status msg to all nodes
2010-08-25 12:39:49.003: [ CSSD][1294657856]clssnmSendingThread: sent 5 status msgs to all nodes
2010-08-25 12:39:49.681: [ CLSF][1221228864]Closing handle:0x78b3360
2010-08-25 12:39:49.681: [ SKGFD][1221228864]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x78aae00 for
disk :ORCL:CLUS1:
2010-08-25 12:39:49.950: [ CLSF][1231718720]Closing handle:0x2aaab0049050
2010-08-25 12:39:49.950: [ SKGFD][1231718720]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x2aaab00d3c40
for disk :ORCL:CLUS1:
2010-08-25 12:39:51.017: [ CSSD][1137310016]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:CLUS1 sched delay 2340 > margin 1500 cur_ms 1034186224 lastalive 1034183884
2010-08-25 12:39:53.007: [ CSSD][1137310016]clssnmvSchedDiskThreads: DiskPingThread for voting file ORCL:CLUS1 sched delay 4330 > margin 1500 cur_ms 1034188214 lastalive 1034183884
2010-08-25 12:39:53.679: [ CSSD][1221228864]clssnmvDiskOpen: Opening ORCL:CLUS1
2010-08-25 12:39:53.679: [ SKGFD][1221228864]Handle 0x2aaab00d3c40 from lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: for disk :ORCL:CLUS1:
2010-08-25 12:39:53.679: [ CLSF][1221228864]Opened hdl:0x2aaab0192050 for dev:ORCL:CLUS1:
2010-08-25 12:39:53.685: [ CSSD][1221228864]clssnmvStatusBlkInit: myinfo nodename hpc1, uniqueness 1282729710
2010-08-25 12:39:53.685: [ CSSD][1221228864]clssnmvDiskAvailabilityChange: voting file ORCL:CLUS1 now online
2010-08-25 12:39:53.685: [ SKGFD][1242208576]Handle 0x2aaab020e470 from lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: for disk :ORCL:CLUS1:
2010-08-25 12:39:53.685: [ CLSF][1242208576]Opened hdl:0x2aaab00e13f0 for dev:ORCL:CLUS1:
2010-08-25 12:39:53.897: [ SKGFD][1231718720]Handle 0x78aae00 from lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: for disk :ORCL:CLUS1:
2010-08-25 12:39:53.897: [ CLSF][1231718720]Opened hdl:0x78b3360 for dev:ORCL:CLUS1:
2010-08-25 12:39:53.906: [ CSSD][1231718720]clssnmvDiskKillCheck: voting disk corrupted (0x00000000,0x00000000) (ORCL:CLUS1)
2010-08-25 12:39:53.906: [ CSSD][1231718720]clssnmvDiskAvailabilityChange: voting file ORCL:CLUS1 now offline
2010-08-25 12:39:54.003: [ CSSD][1294657856]clssnmSendingThread: sending status msg to all nodes
2010-08-25 12:39:54.003: [ CSSD][1294657856]clssnmSendingThread: sent 5 status msgs to all nodes
2010-08-25 12:39:54.687: [ CLSF][1242208576]Closing handle:0x2aaab00e13f0
2010-08-25 12:39:54.687: [ SKGFD][1242208576]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x2aaab020e470
for disk :ORCL:CLUS1:
2010-08-25 12:39:54.783: [ CSSD][1221228864]clssnmvDiskAvailabilityChange: voting file ORCL:CLUS1 now online
2010-08-25 12:39:54.783: [ SKGFD][1242208576]Handle 0x2aaab020e470 from lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: for disk :ORCL:CLUS1:
2010-08-25 12:39:54.783: [ CLSF][1242208576]Opened hdl:0x2aaab0219cf0 for dev:ORCL:CLUS1:
ocrcheckvote disk query shows
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 148348
Used space (kbytes) : 4580
Available space (kbytes) : 143768
ID : 552644455
Device/File Name : +clusterdg
Device/File integrity check failed
Device/File not configured
Cluster registry integrity check failed
Logical corruption check bypassed due to insufficient quorum
crsctl query css votedisk4. Stop the cluster and try to start it on one node and start will fail. Following from the crsd.log
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 1e1e904b27c74f6abf4256e67c274824 (ORCL:CLUS1) [CLUSTERDG]
2. ONLINE 30e33d88487a4f4cbf68242bd8a06726 (ORCL:CLUS2) [CLUSTERDG]
3. ONLINE 3fa62d01b5c04fbcbf7160d3014e762c (ORCL:CLUS3) [CLUSTERDG]
Located 3 voting disk(s).
crsctl stop crsFrom the cluster alert log
crsctl start crs
2010-08-25 12:49:48.385: [ OCRASM][862130768]proprasmo: kgfoCheckMount returned [6]
2010-08-25 12:49:48.385: [ OCRASM][862130768]proprasmo: The ASM disk group clusterdg is not found or not mounted
2010-08-25 12:49:48.385: [ OCRRAW][862130768]proprioo: Failed to open [+clusterdg]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-08-25 12:49:48.385: [ OCRRAW][862130768]proprioo: No OCR/OLR devices are usable
2010-08-25 12:49:48.385: [ OCRASM][862130768]proprasmcl: asmhandle is NULL
2010-08-25 12:49:48.385: [ OCRRAW][862130768]proprinit: Could not open raw device
2010-08-25 12:49:48.385: [ OCRASM][862130768]proprasmcl: asmhandle is NULL
2010-08-25 12:49:48.385: [ OCRAPI][862130768]a_init:16!: Backend init unsuccessful : [26]
2010-08-25 12:49:48.386: [ CRSOCR][862130768] OCR context init failure. Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=8, opn=kgfoOpenFile01, dep=15056, loc=kgfokge
ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +CLUSTERDG.255.4294967295
ORA-17503: ksfdopn:2 Failed to open file +CLUSTERDG.255.4294967295
ORA-15001: disk
] [8]
2010-08-25 12:49:48.386: [ CRSD][862130768][PANIC] CRSD exiting: Could not init OCR, code: 26
2010-08-25 12:49:48.386: [ CRSD][862130768] Done.
[ohasd(25408)]CRS-2765:Resource 'ora.crsd' has failed on server 'hpc1'.5. As said earlier reason for cluster not starting is diskgroup not being mounted. Querying the ASM instance for diskgroup status gives
2010-08-25 12:49:48.385
[crsd(26171)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /opt/app/11.2.0/grid/log/hpc1/crsd/crsd.log.
SQL> select name,state from v$asm_diskgroup;From the ASM alert log
NAME STATE
---------- ---------------
CLUSTERDG DISMOUNTED
DATA MOUNTED
FLASH MOUNTED
NOTE: Diskgroup used for Voting files is:6. At this stage diskgroup could be force mounted and failed disk could be force dropped and cluster could be restarted. But this would reduce the redundancy in the diskgroup. Better option would be to reapir the disk and add it to the diskgroup.
CLUSTERDG
Diskgroup used for OCR is:CLUSTERDG
..
OTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
NOTE: Assigning number (1,1) to disk (ORCL:CLUS2)
NOTE: Assigning number (1,2) to disk (ORCL:CLUS3)
..
NOTE: group CLUSTERDG: updated PST location: disk 0001 (PST copy 0)
NOTE: group CLUSTERDG: updated PST location: disk 0002 (PST copy 1)
NOTE: start heartbeating (grp 1)
kfdp_query(CLUSTERDG): 5
kfdp_queryBg(): 5
NOTE: group CLUSTERDG: updated PST location: disk 0001 (PST copy 0)
NOTE: group CLUSTERDG: updated PST location: disk 0002 (PST copy 1)
NOTE: Assigning number (1,0) to disk ()
kfdp_query(CLUSTERDG): 6
kfdp_queryBg(): 6
NOTE: group CLUSTERDG: updated PST location: disk 0001 (PST copy 0)
NOTE: group CLUSTERDG: updated PST location: disk 0002 (PST copy 1)
NOTE: cache dismounting (clean) group 1/0x29A4804A (CLUSTERDG)
NOTE: dbwr not being msg'd to dismount
NOTE: lgwr not being msg'd to dismount
NOTE: cache dismounted group 1/0x29A4804A (CLUSTERDG)
NOTE: cache ending mount (fail) of group CLUSTERDG number=1 incarn=0x29a4804a
kfdp_dismount(): 7
kfdp_dismountBg(): 7
NOTE: De-assigning number (1,0) from disk ()
NOTE: De-assigning number (1,1) from disk (ORCL:CLUS2)
NOTE: De-assigning number (1,2) from disk (ORCL:CLUS3)
ERROR: diskgroup CLUSTERDG was not mounted
..
WARNING: Disk Group CLUSTERDG containing configured OCR is not mounted
WARNING: Disk Group CLUSTERDG containing voting files is not mounted
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "0" is missing from group number "1"
ERROR: ALTER DISKGROUP ALL MOUNT /* asm agent */
SQL> ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent */
SUCCESS: ALTER DISKGROUP ALL ENABLE VOLUME ALL /* asm agent */
Wed Aug 25 12:49:26 2010
ASM Health Checker found 1 new failures
Wed Aug 25 12:49:28 2010
WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.FLASH.dg (unable to communicate with CRSD/OHASD)
# /etc/init.d/oracleasm deletedisk clus1If the ASM has detected the failed disk and initiated dropping of the disk then following could be observed
Removing ASM disk "clus1": [ OK ]
# /etc/init.d/oracleasm createdisk clus1 /dev/sdc2
Marking disk "clus1" as an ASM disk: [ OK ]
SQL> select name from v$asm_disk;If not drop the disk using force option and add the repaired disk
NAME
----------
_DROPPED_0000_CLUSTERDG
CLUS2
CLUS3
SQL> alter diskgroup clusterdg drop disk clus1;ASM alert log will show vote disk being added to the new disk
alter diskgroup clusterdg drop disk clus1
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15084: ASM disk "CLUS1" is offline and cannot be dropped.
SQL> alter diskgroup clusterdg drop disk clus1 force;
SQL> alter diskgroup clusterdg add failgroup fail1 disk 'ORCL:CLUS1';
Wed Aug 25 12:56:17 20107. Query the vote disks
NOTE: Voting file relocation is required in diskgroup CLUSTERDG
NOTE: Attempting voting file relocation on diskgroup CLUSTERDG
NOTE: voting file allocation on grp 1 disk CLUS1
crsctl query css votedisk8. Check the ocr
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. OFFLINE 1e1e904b27c74f6abf4256e67c274824 () []
2. ONLINE 30e33d88487a4f4cbf68242bd8a06726 (ORCL:CLUS2) [CLUSTERDG]
3. ONLINE 3fa62d01b5c04fbcbf7160d3014e762c (ORCL:CLUS3) [CLUSTERDG]
4. ONLINE 91bc749112ac4fc1bf93cfaa74cffbc7 (ORCL:CLUS1) [CLUSTERDG]
ocrcheck9. Stop and start the cluster
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 148348
Used space (kbytes) : 4580
Available space (kbytes) : 143768
ID : 552644455
Device/File Name : +clusterdg
Device/File integrity check succeeded
Device/File not configured
Cluster registry integrity check succeeded
Logical corruption check succeeded
crsctl stop crs (if needed using -f option)
crsctl start crs
During the testing once on disk was restored subsequent disks were also corrupted one at a time and restored. In some cases when ocrcheck was run right after the corruption following output could be observed
PROT-602: Failed to retrieve data from the cluster registryinstead of Logical corruption check bypassed due to insufficient quorum and cluster would terminate itself. Same restore steps could applied to bring up the cluster.
PROC-26: Error while accessing the physical storage