Sunday, December 19, 2010

Diskmon related CRS startup Issue

Running root.sh on after a clusterware upgrade does not start all the cluster service but time out at EVMD. Looking at the ocssd.log following could be observed
[    CSSD]2010-12-17 11:14:39.774 [1223936320] >TRACE:   clssgmExecuteClientRequest: GRKJOIN recvd from client 60 (0x2aaaac1ddbd0)
[ CSSD]2010-12-17 11:14:39.774 [1223936320] >TRACE: clssgmJoinGrock: global grock crs_version new client 0x2aaaac1ddbd0 with con 0x2aaaac1dd8d0, requested num -1
[ CSSD]2010-12-17 11:14:39.774 [1223936320] >TRACE: clssgmJoinGrock: ignoring grock join before fatal for grock (-1/0x800/crs_version)
[ CSSD]2010-12-17 11:14:40.019 [1223936320] >TRACE: clssgmRegisterClient: proc(3/0x2aaaac1d88e0), client(57/0x2aaaac1ddfc0)
[ CSSD]2010-12-17 11:14:40.019 [1223936320] >TRACE: clssgmExecuteClientRequest: GRKJOIN recvd from client 57 (0x2aaaac1ddfc0)
[ CSSD]2010-12-17 11:14:40.019 [1223936320] >TRACE: clssgmJoinGrock: global grock EVMDMAIN new client 0x2aaaac1ddfc0 with con 0x2aaaac1ddcc0, requested num 1
[ CSSD]2010-12-17 11:14:40.019 [1223936320] >TRACE: clssgmJoinGrock: ignoring grock join before fatal for grock (1/0x200/EVMDMAIN)
[ CSSD]2010-12-17 11:14:40.073 [1129527616] >TRACE: clssgmDispatchCMXMSG: msg type(5) src(2) dest(65535) size(420) tag(00000000) incarnation(187985787)
[ CSSD]2010-12-17 11:14:40.073 [1129527616] >TRACE: clssgmHandleJoinUpdate: (src 2/2) grock SRVM.DATABASE.NODEAPPS.racb01, gid 890, gin 1, member 0
[ CSSD]2010-12-17 11:14:40.073 [1129527616] >TRACE: clssgmAddMember: granted member(0) flags(0x1001) node(2) grock (0x15d08ac0/SRVM.DATABASE.NODEAPPS.racb01)
[ CSSD]2010-12-17 11:14:40.073 [1129527616] >TRACE: clssgmCommonAddMember: global lock grock SRVM.DATABASE.NODEAPPS.racb01 member(0/Remote) node(2) flags 0x1001 0x1000001
[ CSSD]2010-12-17 11:14:40.073 [1129527616] >TRACE: clssgmDispatchCMXMSG: msg type(5) src(2) dest(65535) size(420) tag(00000000) incarnation(187985787)
[ CSSD]2010-12-17 11:14:40.073 [1129527616] >TRACE: clssgmHandleJoinUpdate: (src 2/2) grock SRVM.DATABASE.NODEAPPS.racb04, gid 891, gin 1, member 0
[ CSSD]2010-12-17 11:14:40.073 [1129527616] >TRACE: clssgmAddMember: granted member(0) flags(0x1001) node(2) grock (0x15d09050/SRVM.DATABASE.NODEAPPS.racb04)
[ CSSD]2010-12-17 11:14:40.073 [1129527616] >TRACE: clssgmCommonAddMember: global lock grock SRVM.DATABASE.NODEAPPS.racb04 member(0/Remote) node(2) flags 0x1001 0x1000001
[ CSSD]2010-12-17 11:14:40.077 [1129527616] >TRACE: clssgmDispatchCMXMSG: msg type(6) src(2) dest(65535) size(352) tag(00000000) incarnation(187985787)
[ CSSD]2010-12-17 11:14:40.077 [1129527616] >TRACE: clssgmHandleExitUpdate: (src 2) grock SRVM.DATABASE.NODEAPPS.racb01, member 0
[ CSSD]2010-12-17 11:14:40.077 [1129527616] >TRACE: clssgmRemoveMember: grock SRVM.DATABASE.NODEAPPS.racb01, member number 0 (0x15d08e90) node number 2 state 0x0 member refcnt 1 grock type 3
[ CSSD]2010-12-17 11:14:40.077 [1129527616] >TRACE: clssgmDispatchCMXMSG: msg type(6) src(2) dest(65535) size(352) tag(00000000) incarnation(187985787)
[ CSSD]2010-12-17 11:14:40.077 [1129527616] >TRACE: clssgmHandleExitUpdate: (src 2) grock SRVM.DATABASE.NODEAPPS.racb04, member 0
[ CSSD]2010-12-17 11:14:40.077 [1129527616] >TRACE: clssgmRemoveMember: grock SRVM.DATABASE.NODEAPPS.racb04, member number 0 (0x15d09420) node number 2 state 0x0 member refcnt 1 grock type 3
[ CSSD]2010-12-17 11:14:40.533 [1265895744] >TRACE: kgzf_dskm_conn4: unable to connect to master diskmon in 60120 msec
[ CSSD]2010-12-17 11:14:40.533 [1265895744] >TRACE: kgzf_send_main1: connection to master diskmon timed out
[ CSSD]2010-12-17 11:14:40.534 [1286875456] >TRACE: KGZF: Fatal diskmon condition, IO fencing is not available. For additional error info look at the master diskmon log file (diskmon.log)

[ CSSD]2010-12-17 11:14:40.534 [1286875456] >ERROR: ASSERT clsssc.c 2471
[ CSSD]2010-12-17 11:14:40.534 [1286875456] >ERROR: clssscSAGEInitFenceCompl: Fence completion failed, rc 56859
[ CSSD]2010-12-17 11:14:40.534 [1223936320] >TRACE: clssgmClientShutdown: Aborting client (0x2aaaac1d3e60) proc (0x2aaaac1d3390)
[ CSSD]2010-12-17 11:14:40.534 [1223936320] >TRACE: clssgmClientShutdown: waited 0 seconds on 1 IO capable clients
[ CSSD]2010-12-17 11:14:40.534 [1223936320] >TRACE: clssgmClientShutdown: Waiting for I/O capable proc (0x2aaaac1d3390), pid (7826)
[ CSSD]2010-12-17 11:14:40.636 [1223936320] >TRACE: clssgmClientShutdown: Waiting for I/O capable proc (0x2aaaac1d3390), pid (7826)
[ CSSD]2010-12-17 11:14:40.637 [1223936320] >TRACE: clssgmDeadProc: proc 0x2aaaac1d3390
[ CSSD]2010-12-17 11:14:40.637 [1223936320] >TRACE: clssgmFenceClient: fencing client (0x2aaaac1d3e60), member 1 in group #CSS_CLSSOMON, no share, death 1, SAGE 0
[ CSSD]2010-12-17 11:14:40.637 [1223936320] >TRACE: clssgmUnreferenceMember: global grock #CSS_CLSSOMON member 1 refcount is 1
[ CSSD]2010-12-17 11:14:40.637 [1223936320] >TRACE: clssgmFenceProcessDeath: client (0x2aaaac1d3e60) pid 7826 undead
[ CSSD]2010-12-17 11:14:40.637 [1223936320] >TRACE: clssgmQueueFenceForCheck: (0x2aaaac1dd750) Death check for object type 3, pid 7826
[ CSSD]2010-12-17 11:14:40.637 [1223936320] >TRACE: clssgmDestroyProc: cleaning up proc(0x2aaaac1d3390) con(0x2aaaac1ca990) skgpid ospid 7826 with 0 clients, refcount 0
[ CSSD]2010-12-17 11:14:40.637 [1213446464] >TRACE: clssgmFenceCompletion: (0x2aaaac1dd750) process death fence completed for process 7826, object type 3
[ CSSD]2010-12-17 11:14:40.637 [1213446464] >TRACE: clssgmTermMember: Terminating member 1 (0x2aaaac1d39e0) in grock #CSS_CLSSOMON
[ CSSD]2010-12-17 11:14:40.637 [1213446464] >TRACE: clssgmAllocateRPCIndex: allocated rpc 1 (0x2aaaaad00090)
[ CSSD]2010-12-17 11:14:40.637 [1223936320] >WARNING: clssgmClientShutdown: graceful shutdown completed.
[ CSSD]2010-12-17 11:14:40.637 [1213446464] >TRACE: clssgmRPC: rpc 0x2aaaaad00090 (RPC#1) tag(1002a) sent to node 2
[ CSSD]2010-12-17 11:14:40.638 [1129527616] >TRACE: clssgmDispatchCMXMSG: msg type(6) src(2) dest(65535) size(352) tag(0001002a) incarnation(187985787)
[ CSSD]2010-12-17 11:14:40.638 [1129527616] >TRACE: clssgmHandleExitUpdate: (src 2) grock #CSS_CLSSOMON, member 1
[ CSSD]2010-12-17 11:14:40.638 [1129527616] >TRACE: clssgmRPCDone: rpc 0x2aaaaad00090 (RPC#1) state 6, flags 0x100
[ CSSD]2010-12-17 11:14:40.638 [1140017472] >TRACE: clssnmvDoWork: type 7 for disk 0
[ CSSD]2010-12-17 11:14:40.638 [1129527616] >TRACE: clssgmDelMemCmpl: rpc 0x2aaaaad00090, ret 0, client (nil) member 0x2aaaac1d39e0
[ CSSD]2010-12-17 11:14:40.638 [1129527616] >TRACE: clssgmFreeRPCIndex: freeing rpc 1
[ CSSD]2010-12-17 11:14:40.638 [1129527616] >TRACE: clssgmRemoveMember: grock #CSS_CLSSOMON, member number 1 (0x2aaaac1d39e0) node number 1 state 0x10 member refcnt 0 grock type 2
[ CSSD]2010-12-17 11:14:40.638 [1223936320] >TRACE: clssnmSendManualShut: Notifying all nodes that this node has been manually shut down
Looking in the diskmon.log shows
[ DISKMON]

I/O Fencing and SKGXP HA monitoring daemon -- Version 1.0.0.0
Process 20415 started on 12/17/2010 at 11:14:41.125

[ DISKMON] 12/17/2010 11:14:41.130 dskm main: starting up
[ DISKMON] 12/17/2010 11:14:41.131 [20415:1122142528] dskm_rac_thrd_main: running
[ DISKMON] 12/17/2010 11:14:41.131 [20415:1898995808] dskm_rac_thrd_creat2: got the post from the css event handling thread
[ DISKMON] 12/17/2010 11:14:41.131 [20415:1898995808] dskm main10: skgznp_rm_pipe failed with error 56825
[ DISKMON] 12/17/2010 11:14:41.131 [20415:1898995808] dskm_main10: error 56825 at location skgznprmpipe - unlink() - Operation not permitted
[ DISKMON] 12/17/2010 11:14:41.131 [20415:1898995808] dskm main11: skgznp_create(default pipe) failed with error 56810
[ DISKMON] 12/17/2010 11:14:41.131 [20415:1898995808] dskm_main11: error 56810 at location skgznpcre3 - bind() - Address already in use
[ DISKMON]
Process 20415 exiting on 12/17/2010 at 11:14:41.131
Look in /tmp directory for a file named .oracle_master_diskmon and when this error occurs it's ownership is set to owner root and group root. To fix this issue change the ownership to oracle:oinstall as per metalink note 870832.1 CRS Startup Issues Due to DISKMON.

Removing the /tmp/.oracle_master_diskmon file and rebooting the node also works.