It's a new installation of 11gR2 grid infrastructure (only GI was installed no RAC database was created. Installing GI was sufficient to test this scenario) and therefore the ASM spfile is also inside the same diskgroup as clusterware files. Metalink note 1082943.1 explains how to move it to another diskgroup. If not a pfile from the spfile should be created to re-create the spfile after the ocr and vote disk restore.
Scenario 3.
1. Both OCR and Vote disks are in ASM diskgroup
2. ASM diskgroup has normal redundancy with only three failure groups
3. All failure groups are affected
4. ASM Spfile is also located in the same diskgroup where clusterware files are located.
1. Current OCR, vote disk and ASM Spfile configuration
# ocrcheck2. Identify the disks beloging to the ASM diskgroup using oracleasm query -p and corrupt them to simulate disk failure
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 262120
Used space (kbytes) : 2272
Available space (kbytes) : 259848
ID : 1242190491
Device/File Name : +clusterdg
Device/File integrity check succeeded
Device/File not configured
Cluster registry integrity check succeeded
Logical corruption check succeeded
$ crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 7d49533611734f3bbf404d32f1759ed5 (ORCL:CLUS1) [CLUSTERDG]
2. ONLINE 4a8c288d1ade4f8cbf6588c145b27489 (ORCL:CLUS2) [CLUSTERDG]
3. ONLINE ad241f9823cd4fb9bf3412ca67e591df (ORCL:CLUS3) [CLUSTERDG]
Located 3 voting disk(s).
SQL> show parameter spfile
NAME TYPE VALUE
----------- ---------- -----
spfile string +CLUSTERDG/hpc-cluster/asmparameterfile/registry.253.730565167
srvctl config asm -a
ASM home: /opt/app/11.2.0/grid
ASM listener: LISTENER
ASM is enabled.
# dd if=/dev/zero of=/dev/sdc10 count=204800 bs=81923. ocssd.log will show the detection of vote disk corruption
204800+0 records in
204800+0 records out
1677721600 bytes (1.7 GB) copied, 1.61087 seconds, 1.0 GB/s
# dd if=/dev/zero of=/dev/sdc3 count=204800 bs=8192
204800+0 records in
204800+0 records out
1677721600 bytes (1.7 GB) copied, 1.6883 seconds, 994 MB/s
# dd if=/dev/zero of=/dev/sdc2 count=204800 bs=8192
204800+0 records in
204800+0 records out
1677721600 bytes (1.7 GB) copied, 1.74684 seconds, 960 MB/s
2010-09-24 15:46:02.893: [ CSSD][1136630080]clssgmDestroyProc: cleaning up proc(0x2aaab02131b0) con(0x8db0) skgpid ospid 12671 with 0 clients, refcount 0Querying vote disks frequently will enable to spot the state change
2010-09-24 15:46:02.893: [ CSSD][1136630080]clssgmDiscEndpcl: gipcDestroy 0x8db0
2010-09-24 15:46:03.009: [ CSSD][1231038784]clssnmvDiskKillCheck: voting disk corrupted (0x00000000,0x00000000) (ORCL:CLUS1)
2010-09-24 15:46:03.009: [ CSSD][1231038784]clssnmvDiskAvailabilityChange: voting file ORCL:CLUS1 now offline
2010-09-24 15:46:03.584: [ CLSF][1241528640]Closing handle:0x2aaab008cbe0
...
2010-09-24 15:46:03.584: [ SKGFD][1241528640]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x2aaab0197570 for disk :ORCL:CLUS1:
2010-09-24 15:46:13.269: [ CSSD][1168099648]clssnmvDiskKillCheck: voting disk corrupted (0x00000000,0x00000000) (ORCL:CLUS3)
2010-09-24 15:46:13.269: [ CSSD][1168099648]clssnmvDiskAvailabilityChange: voting file ORCL:CLUS3 now offline
2010-09-24 15:46:13.431: [ CLSF][1157609792]Closing handle:0x2aaab006ac20
2010-09-24 15:46:13.431: [ SKGFD][1157609792]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x2aaab0062830 for disk :ORCL:CLUS3:
$ crsctl query css votediskcrs_stat (deprecated in 11gR2) shows cluster applications are running but ocrcheck fails
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. PENDOFFL 7d49533611734f3bbf404d32f1759ed5 (ORCL:CLUS1) [CLUSTERDG]
2. ONLINE 4a8c288d1ade4f8cbf6588c145b27489 (ORCL:CLUS2) [CLUSTERDG]
3. ONLINE ad241f9823cd4fb9bf3412ca67e591df (ORCL:CLUS3) [CLUSTERDG]
Located 3 voting disk(s).
$ crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 7d49533611734f3bbf404d32f1759ed5 (ORCL:CLUS1) [CLUSTERDG]
2. ONLINE 4a8c288d1ade4f8cbf6588c145b27489 (ORCL:CLUS2) [CLUSTERDG]
3. ONLINE ad241f9823cd4fb9bf3412ca67e591df (ORCL:CLUS3) [CLUSTERDG]
Located 3 voting disk(s).
crs_stat -t4. Stop the crs on all nodes and start crs in exclusive mode in one node. Manual shutdown of ASM instance and database instance might be required if the stop command is unable complete these operations.
Name Type Target State Host
------------------------------------------------------------
ora....ERDG.dg ora....up.type ONLINE ONLINE hpc1
ora....ER.lsnr ora....er.type ONLINE ONLINE hpc1
ora....N1.lsnr ora....er.type OFFLINE OFFLINE
ora.asm ora.asm.type ONLINE ONLINE hpc1
ora.eons ora.eons.type ONLINE ONLINE hpc1
ora.gsd ora.gsd.type OFFLINE OFFLINE
ora....SM1.asm application ONLINE ONLINE hpc1
ora....C1.lsnr application ONLINE ONLINE hpc1
ora.hpc1.gsd application OFFLINE OFFLINE
ora.hpc1.ons application ONLINE OFFLINE
ora.hpc1.vip ora....t1.type ONLINE ONLINE hpc1
ora....network ora....rk.type ONLINE ONLINE hpc1
ora.oc4j ora.oc4j.type OFFLINE OFFLINE
ora.ons ora.ons.type ONLINE OFFLINE
ora....ry.acfs ora....fs.type ONLINE ONLINE hpc1
ora.scan1.vip ora....ip.type OFFLINE OFFLINE
ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage
crsctl stop crsVarious log files will show the status of vote disks and ocr disks. ocssd.log
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hpc1'
CRS-2673: Attempting to stop 'ora.crsd' on 'hpc1'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'hpc1'
CRS-2673: Attempting to stop 'ora.CLUSTERDG.dg' on 'hpc1'
CRS-2673: Attempting to stop 'ora.registry.acfs' on 'hpc1'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'hpc1'
CRS-2677: Stop of 'ora.registry.acfs' on 'hpc1' succeeded
CRS-4549: Unexpected disconnect while executing shutdown request.
CRS-2677: Stop of 'ora.crsd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'hpc1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.evmd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.asm' on 'hpc1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'hpc1'
CRS-2677: Stop of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.asm' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'hpc1'
CRS-2677: Stop of 'ora.cssd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.diskmon' on 'hpc1'
CRS-2673: Attempting to stop 'ora.gipcd' on 'hpc1'
CRS-2677: Stop of 'ora.gipcd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'hpc1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'hpc1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
crsctl start crs -excl
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.gipcd' on 'hpc1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'hpc1'
CRS-2676: Start of 'ora.gipcd' on 'hpc1' succeeded
CRS-2676: Start of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'hpc1'
CRS-2676: Start of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'hpc1'
CRS-2676: Start of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'hpc1'
CRS-2679: Attempting to clean 'ora.diskmon' on 'hpc1'
CRS-2681: Clean of 'ora.diskmon' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.diskmon' on 'hpc1'
CRS-2676: Start of 'ora.diskmon' on 'hpc1' succeeded
CRS-2676: Start of 'ora.cssd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.ctssd' on 'hpc1'
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'hpc1'
CRS-2676: Start of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2676: Start of 'ora.ctssd' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'hpc1'
CRS-2676: Start of 'ora.asm' on 'hpc1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'hpc1'
CRS-2676: Start of 'ora.crsd' on 'hpc1' succeeded
2010-09-24 15:53:43.643: [ CSSD][1147734336]clssnmvDiskVerify: Successful discovery of 0 diskscrsd.log
2010-09-24 15:53:43.643: [ CSSD][1147734336]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2010-09-24 15:53:43.643: [ CSSD][1147734336]clssnmvFindInitialConfigs: No voting files found
2010-09-24 15:53:43.644: [ CSSD][1147734336]clssnmCompleteVFDiscovery: Completing voting file discovery
2010-09-24 15:53:43.644: [ CSSD][1147734336]clssnmvVerifyCommittedConfigVFs: Insufficient voting files found, found 0 of 0 configured, needed 1 voting files
2010-09-24 15:54:30.723: [ OCRASM][21660240]proprasmo: kgfoCheckMount returned [6]ASM alert log shows why the diskgroup containign vote disks and ocr wasn't mounted
2010-09-24 15:54:30.723: [ OCRASM][21660240]proprasmo: The ASM disk group clusterdg is not found or not mounted
2010-09-24 15:54:30.724: [ OCRRAW][21660240]proprioo: Failed to open [+clusterdg]. Returned proprasmo() with [26]. Marking loc
ion as UNAVAILABLE.
2010-09-24 15:54:30.724: [ OCRRAW][21660240]proprioo: No OCR/OLR devices are usable
2010-09-24 15:54:30.724: [ OCRASM][21660240]proprasmcl: asmhandle is NULL
2010-09-24 15:54:30.724: [ OCRRAW][21660240]proprinit: Could not open raw device
2010-09-24 15:54:30.724: [ OCRASM][21660240]proprasmcl: asmhandle is NULL
2010-09-24 15:54:30.724: [ OCRAPI][21660240]a_init:16!: Backend init unsuccessful : [26]
2010-09-24 15:54:30.724: [ CRSOCR][21660240] OCR context init failure. Error: PROC-26: Error while accessing the physical sto
ge ASM error [SLOS: cat=8, opn=kgfoOpenFile01, dep=15056, loc=kgfokge
ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +CLUSTERDG.255.4294967295
ORA-17503: ksfdopn:2 Failed to open file +CLUSTERDG.255.4294967295
ORA-15001: disk] [8]
2010-09-24 15:54:30.724: [ CRSD][21660240][PANIC] CRSD exiting: Could not init OCR, code: 26
2010-09-24 15:54:30.724: [ CRSD][21660240] Done.
SQL> ALTER DISKGROUP ALL MOUNT /* asm agent */5. There's no SPfile for ASM but instance will be up
Diskgroup used for OCR is:CLUSTERDG
NOTE: cache registered group CLUSTERDG number=1 incarn=0x6ed824a0
NOTE: cache began mount (first) of group CLUSTERDG number=1 incarn=0x6ed824a0
NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
ERROR: no PST quorum in group: required 2, found 0
SQL> show parameter spfile6. To test various replace and repair scenarios, a diskgroup was created with a different name. Oracle Clusterware Admin Guide states "If the original OCR location does not exist, then you must create an empty (0 byte) OCR location before you run the ocrconfig -add or ocrconfig -replace commands. The OCR location that you are replacing can be either online or offline."
NAME TYPE VALUE
------- --------- ------
spfile string
Replace and repair options could be used to replace the current location and to add, add and replace existing OCR locations. Cluster Admin guide also states "You cannot repair the OCR on a node on which Oracle Clusterware is running.If you run the ocrconfig -add | -repair | -replace command, then the device, file, or Oracle ASM disk group that you are adding must be accessible. This means that a device must exist. You must create an empty (0 byte) OCR location, or the Oracle ASM disk group must exist and be mounted."
There's the question of moutning ASM diskgroup while the crs down. If you try to manually mount the ASM instnace that is part of a cluster you'd get
sqlplus / as sysasmAll of the replace and repair option were useless and failed
SQL> startup
ORA-01078: failure in processing system parameters
ORA-29701: unable to connect to Cluster Synchronization Service
# ocrconfig -restore /opt/app/11.2.0/grid/cdata/hpc-cluster/backup_20100924_142540.ocrEven creating diskgroup with the original name and another one with new name and trying to repair replace also failed
PROT-16: Internal Error
# ocrconfig -replace +clusterdg -replacement +clusterdgbk
PROT-28: Cannot delete or replace the only configured Oracle Cluster Registry location
# ocrconfig -repair -replace +clusterdg -replacement +clusterdgbk
PROT-21: Invalid parameter
# ocrconfig -repair -replace +clusterdg -replacement +clusterdgbk
PROT-21: Invalid parameter
# ocrconfig -repair -add +clusterdgbk
PROT-21: Invalid parameter
# ocrconfig -add +clusterdgbk
PROT-1: Failed to initialize ocrconfig
SQL> create diskgroup clusterdg disk 'ORCL:CLUS1' disk 'ORCL:CLUS2' disk 'ORCL:CLUS3' attribute 'compatible.asm'='11.2';7. Finally created a diskgroup with the same name as original and did the OCR restore
Diskgroup created.
SQL> create diskgroup clusterdg2 disk 'ORCL:RED1' disk 'ORCL:RED2' disk 'ORCL:RED3' attribute 'compatible.asm'='11.2';
Diskgroup created.
# ocrconfig -repair -replace +clusterdg -replacement +clusterdg2
PROT-21: Invalid parameter
SQL> create diskgroup clusterdg disk 'ORCL:RED1' disk 'ORCL:RED2' disk 'ORCL:RED3' attribute 'compatible.asm'='11.2';Unlike the previous scenario there's no crs restart required to restore the vote disks, it was possible to restore the vote disks after restoring ocr
ocrconfig -restore /opt/app/11.2.0/grid/cdata/hpc-cluster/backup_20100924_142540.ocr
# crsctl query css votedisk8. Stop the crs on the node and start crs on all nodes
Located 0 voting disk(s).
# crsctl replace votedisk +clusterdg
Successful addition of voting disk 92814292a2ec4fa7bf2d6fe10960cc55
Successful addition of voting disk 57b152766d5f4f3dbf2935c93556b7f5
Successful addition of voting disk 6120792eb1284f64bf958f13a4947ece
Successfully replaced voting disk group with +clusterdg.
CRS-4266: Voting file(s) successfully replaced
crsctl stop crs -f9. The last step in this scenario is recreate the ASM SPfile. When ASM instance is started with a spfile or pfile, ASM alert log shows the file location. For spfile
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hpc1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'hpc1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.asm' on 'hpc1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'hpc1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'hpc1'
CRS-2677: Stop of 'ora.cssdmonitor' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.asm' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'hpc1'
CRS-2677: Stop of 'ora.cssd' on 'hpc1' succeeded
CRS-2673: Attempting to stop 'ora.diskmon' on 'hpc1'
CRS-2673: Attempting to stop 'ora.gipcd' on 'hpc1'
CRS-2677: Stop of 'ora.gipcd' on 'hpc1' succeeded
CRS-2677: Stop of 'ora.diskmon' on 'hpc1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'hpc1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit ProductionFor pfile
With the Real Application Clusters and Automatic Storage Management options.
Using parameter settings in server-side spfile +CLUSTERDG/hpc-cluster/asmparameterfile/registry.253.730565167
System parameters with non-default values:
large_pool_size = 12M
instance_type = "asm"
remote_login_passwordfile= "EXCLUSIVE"
asm_diskstring = "ORCL:CLUS*"
asm_power_limit = 1
diagnostic_dest = "/opt/app/oracle"
Cluster communication is configured to use the following interface(s) for this instance
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit ProductionBut after the corruption of all the disks on the ASM diskgroup, there won't be a server side spfile and even when there's no client side pfile ASM instance starts. ASM alert log shows that it is using default parameter setting without a parameter file
With the Real Application Clusters and Automatic Storage Management options.
Using parameter settings in client-side pfile /opt/app/11.2.0/grid/dbs/init+ASM1.ora on machine hpc1
System parameters with non-default values:
large_pool_size = 12M
instance_type = "asm"
remote_login_passwordfile= "EXCLUSIVE"
asm_diskstring = "ORCL:CLUS*"
asm_power_limit = 1
diagnostic_dest = "/opt/app/oracle"
Cluster communication is configured to use the following interface(s) for this instance
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit ProductionThese default parameters cannot be considered as a spfile or a pfile trying to create one would fail
With the Real Application Clusters and Automatic Storage Management options.
WARNING: using default parameter settings without any parameter file
Cluster communication is configured to use the following interface(s) for this instance
SQL> create pfile='/home/oracle/pfile.ora' from spfile;But a pfile could be created from these values in memory using
create pfile='/home/oracle/pfile.ora' from spfile
*
ERROR at line 1:
ORA-01565: error in identifying file '?/dbs/spfile@.ora'
ORA-27037: unable to obtain file status
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
SQL> create spfile from pfile;
create spfile from pfile
*
ERROR at line 1:
ORA-17502: ksfdcre:4 Failed to create file
+CLUSTERDG/hpc-cluster/asmparameterfile/registry.253.730565167
ORA-15177: cannot operate on system aliases
SQL> create pfile='/home/oracle/pfile.ora' from memory;Or if a pfile is available from the time before the diskgroup corruption it could be used to restore the ASM spfile
File created.
SQL> create spfile='+clusterdg' from pfile='/home/oracle/asmpfile.ora';Last command above would create a new spfile with a different alias than the original and ASM instance would use than during startup.
File created.
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit ProductionUseful Metalink note
With the Real Application Clusters and Automatic Storage Management options.
Using parameter settings in server-side spfile +CLUSTERDG/hpc-cluster/asmparameterfile/registry.253.730908391
System parameters with non-default values:
large_pool_size = 12M
instance_type = "asm"
remote_login_passwordfile= "EXCLUSIVE"
asm_diskstring = "ORCL:CLUS*"
asm_power_limit = 1
diagnostic_dest = "/opt/app/oracle"
Cluster communication is configured to use the following interface(s) for this instance
How to restore ASM based OCR after complete loss of the CRS diskgroup on Linux/Unix systems [ID 1062983.1]